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The  focus  of  research  was  the  application  of  systolic  array  architectures  to  cornpytations  in  numerical 
linear  algebra,  and  the  applications  of  these  architectures  to  digital  signal  processing  (DSP)  and  elliptic  partial 
differential  equations  (PDE).  Research  was  conducted  on  a  number  of  topics!  first  I  will  discuss  those  for 
which  complete  reports  have  been  written. 

C  ‘ 

c-r  Matrixlriangularization  by  systolic  arrays,* 

/•'"Earlier  work  by~Gent!emernmd~Kung  was  extended;  many  practical  questions  concerning 

/  application  to  problems  in  DSP  were  discussed  [1).  The  literature  on  factorization  of  banded 

1  matrices  was  unified  and  extended  [6]. 

^♦Singular  value  and  eigenvalue  computations^ 

A  new  architccture  for  singalarvaluc  decomposition  (SVD)  was  developed  [3].  I  also  considered 
the  use  of  a  proposed  systolic  architecture  for  eigenvalue  and  SVD  when  the  matrix  is  too  large  for 
the  systolic  array  to  accommodate  (4,7). 

^•Elliptic  PDEj' 

/  TIrcrdesigirof  a  highly  parallel  architecture  for  the  multigrid  method  and  an  analysis  of  its 
performance  was  given  in  joint  work  with  T.  Chan  of  Yale  [2,8]. 

')  •  Updating  Cholesky  factorizations.  C  - - 

'  The  problem  is  to  recompute  the  Cholesky  Factorization  (  A  =  LLT  )  of  a  symetric  positive 
definite  matrix  when  it  is  changed  by  a  matrix  of  low  rank.  This  arises  often  in  DSP  and  also  in 
quasi-Newton  methods  in  optimization.  1  described  several  systolic  architectures  in  joint  work 
with  my  student,  W.P.  Tang  [5]. 

Several  projects  were  begun,  and  work  continues  on  these.  During  my  stay  at  the  Royal  Institute  of 
Technology  in  Stockholm  I  began  work  with  Dainis  Millars  on  construction  of  systolic  arrays  with  custom 
VLSI  and  off-the-shelf  VLSI  components.  A  report  is  forthcoming;  a  patent  for  the  custom  chip  will  be 
sought 

Work  also  began  in  Stockholm,  with  Erik  Tidcn  and  Bjorn  Lisper,  on  the  synthesis  and  verification  of 
systolic  arrays. 

With  Lars  Eldcn  of  Linkoping  University,  I  developed  a  systolic  architecture  for  linear,  discrete  ill- 
posed  problems.  The  report  will  appear  early  in  1984. 

Earlier  work  of  mine  on  systolic  methods  for  the  eigenvalue  problem  raised  the  issue  of  pipelining 
iterations  of  the  QR  algorithm.  The  difficulty  in  doing  so  is  in  finding  a  suitable  shift  strategy.  A  student, 
W.  Wilson,  has  begun  some  numerical  experiments;  his  results,  unfortunately,  arc  not  encouraging.  A  report 
will  appear  in  1984. 


Finally,  work  has  begun  on  the  implementation  of  several  modem  high  resolution  direction-finding 
algorithms  in  DSP. 
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A  Systolic  Architecture 
for  Singular  Value  Decomposition 

Vne  architecture  systolique 
pour  la  decomposition  en  valeurs  singuliires 

R.  Schreiber 
(Stanford,  USA) 


1  INTRODUCTION 

Systolic  arrays  are  highly  parallel  computing 
structures  specific  to  particular  computing  tasks. 
They  are  well-suited  for  reliable  and  inexpensive 
Implementation  using  many  identical  VLSI  compon¬ 
ents.  The  designs  consist  of  one  and  two-dimen¬ 
sional  lattices  of  identical  processing  elements. 
Communication  of  data  occurs  only  between  neigh¬ 
boring  cells.  Control  signals  propagate  through 
the  array  like  data.  These  characteristics  make 
it  feasible  to  construct  very  large  arrays. 

Several  modern  methods  in  digital  signal  pro¬ 
cessing  require  real-time  solution  of  some  of  the 
basic  problems  of  linear  algebra  [13).  Fortunately 
systolic  arrays  have  been  developed  for  many  of 
these  problems  [4,10,12].  But  several  gaps  remain. 
Only  partially  satisfying  results  have  been  obtain¬ 
ed  for  the  eigenvalue  and  singular  value  decomposi¬ 
tions,  for  example. 

Here  we  consider  a  systolic  array  for  the  sing¬ 
ular  value  decomposition  (SVD).  An  SVD  of  an  m  x  n 
(m  21  n)  matrix  A  is  a  factorization 


we  discuss  the  implementation  using  VLSI  chips  of 
these  systolic  eigenvalue  and  SVD  arrays. 


The  SVD  is  often  used  to  regularize  ill-condi¬ 
tioned  problems.  In  these  there  are  p  <  n  large 
singular  values  and  n-p  that  are  much  smaller. 

What  is  needed  is  the  pseudoinverse  of  the  rank  p 
matrix  closest  (with  respect  to  the  2-norm)  to  A, 


\p)  ‘  U1  °l  VI  + 


+  u  o 
P  P 


V 


T 
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We  have  recently  developed  a  new  algorithm  to  com- 
A^pj  that  involves  nothing  but  a  sequence  of  mat¬ 
rix-matrix  products,  for  which  systolic  arrays  are 
well-known  (see,  e.g.,  [9].)  An  alternate  form  of 
Che  algorithm  can  be  used  to  compute  the  related 
orthogonal  projection  matrix 


P(P>  "  V1  Vl  + 
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2  AN  SVD  ARCHITECTURC 

Let  A  be  a  given  matrix.  The  singular  values 
of  A  will  be  obtained  in  two  phases: 


A  •  t  !  VT 

where  V  is  m  x  n  with  orthonormal  columns,  l  ■ 
diag(o j ,  J2,  ....  on)  with  0j  >  a2  >  ...  »  on> 

and  V  is  orthogonal.  There  are  many  Important 
applications  of  the  SVD  [1,6,13). 


1.  A  is  reduced  to  an  upper  triangular 
matrix  B  with  bandwidth  k+1, 

b, .  -  0  if  i  ■  j  or  1  •  j-k, 
ij 

and  B  *  QAP  where  Q  and  P  are  orthogonal. 


There  have  been  several  earlier  investigations 
of  parallel  SVD  algorithms  and  arrays.  First, 

Flnn^  Luk,  and  Pottle  describe  a  systolic  structure 
of  n-/2  processors  and  two  algorithms  that  use  it. 
But  the  convergence  of  their  algorithms  has  not 
been  proved  and  may  be  slow  [3).  Heller  and  Ipsen 
[81  describe  an  array  for  computing  the  singular 
values  of  a  banded  matrix  with  bandwidth  w.  They 
use  0(w)  processors  and  O(wn')  time.  Brent  and 
Luk  (2)  describe  an  n/2  processor  linear  array 
that  implements  a  one-sided  orthogonallzatlon 
method  and  converges  reliably  in  0(n  log  n)  time. 
Unfortunately  the  processors  in  this  array  are 
quite  complex,  and  It  is  not  clear  that  matrices 
with  more  than  n  columns  can  be  efficiently 
accomodated . 

In  this  paper  we  discuss  two  topics.  First, 
we  show  how  an  architecture  for  computing  the 
eigenvalues  of  a  symmetric  matrix  can  be  modified 
to  compute  singular  values  and  vectors.  Second, 


2.  B  is  diagonalized  by  an  Iterative  process 
equivalent  to  |mpltcitly  shifted  QR 
Iteration  on  B  B. 

With  k-1  this  is  the  standard  method  of  Colub 
and  Relnsch  [7).  The  reason  for  allowing  k'l  is  an 
Increase  in  the  parallelism.  In  phase  1,  kn  proc¬ 
essors  are  employed;  the  time  is  0(mn/k).  In 
phase  2,  2k2  processors  are  used;  the  time  per  It¬ 
eration  is  6n+0(k). 

2. 1  Reduction  to  banded  form 

The  reduction  step  uses  a  k  x  n  trapezoidal 
array  that  has  been  described  in  detail  previously 
(121.  Let  the  m  x  n  matrix  X  be  partitioned  as 


t.O.S.  (ULLITIN  Ot  LA  OMICTION  Oft  ITUOIt  IT  MCHMCMII 
•Mil  C  -  MATHIMATWUII.  (NSOHMATtOUl  if  I.  M3  X.  143.141 
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where  is  k  x  k.  The  array  applies  a  sequence 
of  Givens  rotations  to  the  rows  of  X  to  zero  the 
first  k  columns  below  the  main  diagonal.  If  Q  is 
the  product  of  these  rotations,  then 


Q^  are  orthogonal. 

T 

First  we  consider  QR  iteration  on  B  B  without 
shifts.  This  can  be  realized  by  Che  procedure 


QX 


R11  Y12 

o  y22 


1. 


Find  such  that 

L(i)  „  BU)Q(i) 


where  is  k  x  k  upper  triangular.  R^,  ^12* 

and  the  parameters  of  the  rotations  that  make 
up  Q  all  flow  out  from  the  array.  The  time  requir¬ 
ed  is  m.  (Here  and  below  we  give  ''times*'  in  units 
of  the  time  required  for  an  individual  cell  in  the 
array  to  carry  out  its  computation.) 


is  lover  triangular, 

2.  Find  such  that 

B(i+n  „  p(i)L(i) 

is  upper  triangular. 


Now  let 


,  A11  A12 
A  3 

A->1  A22 

-  — 

be  the  given  matrix.  Send  A  through  the  array  to 
produce 


V 


R11  C12 


n 


'22 

Next  send  [CY.,,  C^l 


through  to  produce 


P1 ^C12"  C22l  *  fL12’  A22 ^ 


(Although  the  input  matrix  has  m  columns,  the  array 
can  handle  this  factorization  in  time  o  by  making 
m/n  passes  over  the  data  1121.  Now  continue  this 
process  using  A22  in  place  of  A.  After  Jifn/k] 

such  steps  we  have  produced  a  k+1  diagonal,  upper 
triangular  matrix  B, 


Rl,l  Ll,2 


R2,2  L2,3 


B  - 


J-l.J 


J.J 


such  that  A  *  QBP  where  Q  and  P  are  orthogonal. 
The  total  time  used  is  mJ  -  mn/k. 


The  transposition  of  data  required  can  be  done 
by  a  specialized  switching  device,  a  "systolic 
shifter,"  described  earlier  [12]. 


Both  steps  of  this  procedure  can  be  carried  out  by 
the  Heller-lpsen  (HI)  array  [8).  This  is  a  k  x  w 
rectangular  array  for  QR  factorization  of  w-diag- 
onal  matrices.  In  this  array,  plane  rotations  are 
generated  at  the  left  edge  and  move  to  the  right, 
affecting  a  pair  of  matrix  rows.  Take  w  =  k+1. 

B^  enters  the  matrix  at  the  bottom,  each  diagonal 
entering,  one  element  at  a  time,  into  one  of  the 
processors.  The  array  annihilates  the  elements  of 
the  upper  triangle  of  Bw.  This  causes  fill-in 
of  k  diagonals  in  the  lower  triangle.  The  result¬ 
ing  matrix  Ll  emerges  from  the  top  in  the  same 
diagonal-per-processor  format.  It  immediately 
enters  a  second  array.  This  array  annihilates  the 
lower  triangle  of  1(0  and  the  resulting  upper  tri¬ 
angular  matrix  bU+1)  emerges  from  the  top  (Fig.  1) 

The  time  is  2n+4k  per  iteration:  element  a  enters 

nn 

the  bottom  array  at  time  2n,  leaves  at  the  upper 
left  corner  at  time  2n+2k,  and  leaves  the  top  array 
at  time  2n+4k. 

Unshifted  QR  converges  slowly.  The  rate  of 
convergence  of  b  to  is  ■  1°  some  situ¬ 

ations  this  may  Be  adequate  and  the  simplicity  of 
the  structure  used  is  then  a  real  advantage. 

It  is  also  easy  to  pipeline  the  iterations  As 
g(i+l)  comes  out  of  the  second  array  it  can  be  sent 
directly  into  another  pair  of  arrays  to  begin  the 
(i+l)th  iterations,  etc.  As  many  as  n/4k  itera¬ 
tions  can  be  effectively  pipelined;  any  more  and 
the  pipe  length  exceeds  n,  so  that  the  pipe 
never  gets  full.  If  we  choose  k  *  il(n^“)  and 
pipeline  n/4k  *  Ofn1'^)  iterations  thsn  the  number 
of  processors  in  both  arrays  is  0(nJ/')  and  the 
total  time,  assuming  0(n)  iterations  of  OR  are 
required,  is  also  0(n^'-).  These  considerations 
also  apply  to  the  array  Implementation  of  the 
implicitly  shifted  QR  algorithm  that  is  discussed 
below,  with  one  important  proviso.  When  pipelin¬ 
ing  the  iterations,  some  strategy  for  choosing 
several  shifts  in  advance  must  be  used. 


When  singular  vectors  are  to  be  computed,  the 
rotations  generated  by  the  array  may  be  applied  to 
identity  matrices  of  order  m  and  n.  This  can  be 
done  bv  the  array.  These  matrices  accumulate  the 
product  of  the  rotations  used,  that  is  the  ortho¬ 
gonal  matrices  Q  and  P  above. 

2 . 2  QR  iteration 

Now  we  consider  QR  iteration  to  get  the  singu¬ 
lar  values  of  B,  hence  those  of.A,  We  shall  gener¬ 
ate  a  sequence  of  matrices  {B1  1 )  having  the  same 
structure  as  B  and  converging  to  a  diagonal  matrix. 

B(n)  -  B  and  B(1+1)  -  P(l,B(1)Q(l)  where  P(l)  and 


2.2.1  Implicitly  shifted  QR  iteration 

To  obtain  adequate  convergence  speed  we  need  to 
incorporate  shifts.  Following  Stewart  [1-4],  sup¬ 
pose  that  one  QR  iteration  with  shift  *  is  per¬ 
formed  on  B*B,  and  the  orthogonal  matrix  so  gene¬ 
rated  is  Q.  Then  proceed  as  follows: 

1.  Let  Q«  be  any  matrix  whose  first  k  columns 
are  tne  same  as  those  of  Q; 

2.  Using  the  same  technique  as  in  Section  2.1, 
reduce  BQ^  to  upper  triangular  k+1  diag¬ 
onal  form,  yielding  a  matrix  B*. 
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T 

Ic  can  be  shown  chat  B'  B'  Is  the  matrix  that  would 
result  from  one  QR  step  with  shift  1  applied  to 
B~B. 


To  use  the  trapezoidal  array  as  described  above 
to  carry  out  step  2  would  be  inefficient.  Rather 
we  proceed  as  follows.  Qq  is  composed  of  plane 
rotations  that  zero  the  first  k  columns  of  (B^B-X) 
below  the  main  diagonal.  Applying  Qq  to  B  causes 
fill-in  in  the  k  diagonals  below  the  main  diagonal, 
confined  to  rows  2,  3,  ....  2k.  See  Fig.  2  for  the 
case  k=2. 


Figure  1 

Unshlf ted  QR  Iteration  with  two  Heller- 
Ipsen  arrays 


xxx 
x  x  x  x 

X  X  X  X  X 

0  X  X  X  X  X 

0  0  0  0  x  x  x 


Figure  2 

Structure  of  BQq  ,  k-2 


Let  the  first  2k  rows  of  BQq  be  sent  into  a 
k  x  2k+l  HI  array.  By  a  sequence  of  plane  rota¬ 
tions  applied  to  the  rows,  the  array  removes  the 
"bulge”  In  the  lower  triangle,  adding  a  bulge  of 
the  same  shape  In  the  first  3k  columns  of  the  upper 
triangle.  This  data  flows  directly  Into  another 
k  x  2k-*- 1  HI  array  that  removes  the  elements  to  the 
right  of  the  k1*1  superdlagonal  and  causes  a  new 
bulge  to  appear  in  the  lower  triangle.  In  columns 
k*T  through  3k— 1  and  extending  to  row  3k.  (The 
second  HI  array  Is  the  mirror  Image  of  the  first. 


Rotations  are  generated  at  its  right  edge  and  move 
left,  affecting  pairs  of  matrix  columns.  Let 
and  Q1  be  the  orthogonal  matrices  implicitly  used 
by  the  two  HI  arrays.  The  matfllx  emerging  from  the 
second  array  is 


B1  ■  piBVi 

and  it  has  the  form 


11 
0  B 


21 


where  is  a  k  x  n  upper  triangular,  k+1  diagonal 
matrix,  and  B  is  an  n-k  x  n-k  matrix  of  the  same 
form  as  BQq.  Pig-  3  illustrates  this  for  k-2. 


time 
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Figure  3 

"Chasing  the  bulge"  with  two  k  x  2k-*- 1 
Heller-Ipsen  arrays 


Sow  we  do  exactly  the  same  thing  to  B.,^,  etc. 
This  yields  matrices 


B. 

1 


Vi. mV 


1-3. 
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with 


"J.j 

<-°  Vi.jJ 


and  J  »  f"(n-l)/k7.  Finally  B' 
is  the  matrix  we  require. 


The  time  neede  is  6n.  It  takes  2k  steps  for  an 
HI  array  to  start  producing  output.  Thus,  the  sec¬ 
ond  array  starts  its  output  at  time  4k.  The  first 
element  of  B^+^  which  is  the  (k+l)st  element 

of  the  main  diagonal  to  come  out  of  the  second 
array,  comes  out  at  time  6k.  By  this  time  the 
first  arrays  inputs  have  become  idle,  so  this 
element  can  immediately  reenter.  Therefore  one 
step,  from  B^  to  takes  time  6k.  There  are 

pCn-l)/kJ  such  steps,  hence  about  6n  time  for  the 
whole  process. 

2. 3  Complex  matrices 

In  signal  processing  applications,  complex 
matrices  often  arise.  Here  we  discuss  the  algor¬ 
ithms  to  be  used  for  QR  iteration  with  complex 
matrices.  Essentially  we  show  that  the  plane  rota¬ 
tions  used  can  be  of  a  special  form: 
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where  x,  y,  and  c  are  complex  and  o  is  real.  This 
saves  1/4  of  the  multiplications  used  by  a  fully 
complex  plane  rotation  with  complex  s  instead  of 
a  —  12  are  used  instead  of  16.  We  shall  call 
these  c ,o  rotations. 

It  is  possible  to  compute  the  SVD  of  a  complex 
m  x  n  matrix  A  ■  Ar+1Aj  using  real  arithmetic.  One 
finds  the  SVD  of  the  2m  x  2n  real  matrix 


Among  the  2n  singular  values  each  singular  value  of 
A  occurs  twice,  and  the  singular  vectors  are  of  the 
form  [xj,  xp  where  x  »  xR  +  lxT  is  a  singular 
of 
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But  the  cost  is  much  greater.  In 


vector  of~  A. 
units  where  the  cost  of  doing  an  m  x  n  real  SVD  is 
one,  the  cost  for  the  real  2m  x  2n  SVD  is  8  while 
the  complex  m  x  n  approach  costs  3  (not  4,  since 
the  use  of  the  c,o  rotations  saves  1/4  of  the  work). 

We  now  show  that  the  c,o  rotations  suffice. 

To  start,  we  note  that  the  banded  matrix  B  produced 
by  the  reduction  phase  can  always  be  chosen  to  have 
positive  real  elements  on  its  main  and  k**1  super- 
diagonals.  Indeed  the  reduction  B  *  QAP  to  k+1 
diagonal,  upper  triangular  form  is  not  unique: 
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is  also  such  a  reduction  for  any  unitary  diagonal 
matrices  Dj  and  D^.  These  can  always  be  chosen  to 
give  B  the  stated  property.  In  fact,  the  trap¬ 
ezoidal  array  can  do  this  automatically  [12).  When 
it  generates  a  rotation  to  zero  some  matrix  ele¬ 
ment,  the  second  element  of  the  pair  (x,y)  for 
instance,  it  chooses  the  paramenters  so  that  the 
result  of  the  rotation  is  the  pair 
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Furthermore,  the  elements  to  be  zeroed  are  the  real 
elements  resulting  from  previous  rotations.  The 
rotations  to  do  the  zeroing  can,  for  this  reason, 
be  taken  to  be  c,o  rotations. 

Now  we  look  at  the  second  phase.  Because  of 
the  structure  of  B,  the  main,  k1*1  super  and  k**1  sub- 
diagonals  of  BTB  are  all  real.  The  rotations  that 
comprise  Qq  can  be  taken  to  be  c,o  rotations  since 
they  zero  real  elements.  And  by  keeping  track  of 
the  locations  of  real  elements  one  can  show  that  in 
BQq  all  elements  of  the  outer  diagonals  are  real. 
Again  because  the  elements  to  be  annihilated  are 
real,  c,o  rotations  can  be  used  to  eliminate  the 
bulge.  A  matrix  with  the  same  structure  as  BQ 
results,  and  the  proof  therefore  follows  by  induc¬ 
tion. 


2.4  An  alternate  scheme 

Gene  Golub  has  pointed  out  that  the  eigenvalues 
of  the  2n  x  2n  matrix 
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are  the  singular  values  of  A  taken  with  positive  and 
negative  sign,  and  if  (x^,  y^)  is  an  eigenvector  of 
C  then  x  is  a  left  singular  vector  of  B  and  y  is  a 
right  singular  vector  of  B  [5],  Thus  we  may  attempt 
to  find  the  eigendecomposition  of  C.  After  a  sym¬ 
metric  Interchange  of  rows  and  columns  corresponding 
to  the  permutation  (n+1,  2,  n+2,  2,  ....  2n,  n) , 

C  is  a  symaetric  4k-l  diagonal  matrix.  A  2k-l  x 
4k-l  HI  array  can  implement  one  step  of  the  QR 
method  with  shifts  for  this  matrix  in  n  +  0(k) 
time  [10],  In  the  complex  case,  both  C  and  the 
permuted  C  have  real  outermost  diagonals,  so  c,o 
rotations  can  be  used.  Thus,  although  twice  as  much 
hardware  is  used,  the  time  per  iteration  is  1/6  as 
great  as  for  the  previous  scheme. 


3  VLSI  IMPLEMENTATION 

Now  we  consider  how  to  build  the  cells  of  the 
HI  array.  The  fundamental  unit  we  use  in  this  con¬ 
struction  is  a  multiply-add  cell,  whose  function  is 
this: 
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Outputs  leave  the  cell  one  clock  after  inputs  enter. 

Although  other  primitive  units  (CORDIC  blocks, 
for  example)  might  be  used,  we  feel  that  the  mult¬ 
iply-add  is  a  good  basis  for  such  an  investigation. 
Currently,  a  floating  point  multlplv-add  is  about 
what  can  be  Integrated  on  a  single  chip.  It  is 
almost  universally  useful.  Indeed,  the  multiply- 
add  pair  is  often  the  inner  loop  in  numerical  linear 
algebraic  computations.  Even  when  larger  cells  and 
pieces  of  arrays  can  be  integrated  into  single 
chips,  designs  based  on  the  multiply-add  primitive 
will  be  useful. 

We  shall  discuss  Implementation  of  the  HI  arrav 
cells  for  complex  data.  The  real  case  was  discussed 
earlier  [11)  as  were  the  cells  of  the  trapezoidal 
array  (12). 
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The  complex  HI  array  triangularizes  a  banded 
input  matrix  using  c,o  rotations  of  the  form  (1). 
The  rotations  are  applied  to  a  pair  (xty)  of  matrix 
elements  by  an  internal  cell 
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after  having  been  generated  by  a  boundary  cell 
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In  the  internal  cell  computation,  4  quantities 
are  computed,  each  requiring  3  multiplies  and  2 
adds.  Let  zR  and  denote  the  real  and  imaginary 
parts  of  the  complex  quantity  z.  The  computed 
values  are 


XR  ^  CR*R  +  CIXI  *  °V 
xi  1  CRX1  *  C1XR  '  ayi: 

yR  *  CRyR  *  Ciyl  +  0XR> 
yI  *  CRyi  +  CIyR  +  JXI- 


Using  4  multlply-add  chips  we  can  construct  a 
compound  cell  that  gives  these  results  in  the  least 
possible  time,  3  clocks.  We  assume  that  complex 
quantities  are  represented  in  "word  serial”  form, 
with  the  real  part  preceding  the  imaginary  part  on 
the  same  data  path.  A  schedule  using  4  chips  that 
achieves  the  minimum  latency  is  shown  in  Table  1. 


Table  1.  Schedule  for  Internal  HI  Cell 
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The  computation  at  the  boundary  cell  is  ♦his; 
given  inputs  x  and  compute 


A  second  primitive,  for  divide  and  square  root,  is 
needed  to  implement  the  boundary  cell.  We  assume 
that  a  chip  for  computing 

(a,b)  — ■>  a  /  b1^2 

is  available.  A  compound  cell  using  one  multiply' 
add  and  two  of  these  square  root  chips  can  produce 
results  at  the  rate  required  to  keep  up  with  the 
internal  cell.  A  schedule  is  shown  in  Table  2. 

The  overall  array  timing  is  now  that  of  the  "ideal” 
HI  array  in  which  everything  happens  in  a  single 
cycle  (of  length  3  chip  clocks)  .  The  cells  are 
used  1/2  of  the  time,  but  two  independent  problems 
can  be  solved  simul taneously,  making  full  use  of 
the  hardware. 


Table  2.  Schedule  for  HI  Boundary  Cell 
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Abstract 


We  describe  and  analyse  a  family  of  highly  parallel  special  purpose  computing  networks  that 
implement  multi-grid  algorithms  for  solving  elliptic  difference  equations.  The  networks  have 
many  of  the  features  and  advantages  of  systolic  arrays.  We  consider  the  speedup  achieved  by 
these  designs  and  bow  this  is  affected  by  the  choice  of  algorithm  parameters  and  the  level  cf 
parallelism  employed.  We  find,  for  example,  that  when  there  is  one  processor  per  grid-point.  tLe 
designs  cannot  avoid  suffering  a  loss  of  efficiency  as  the  grid-size  tends  to  zero. 

1.  Introduction 

We  shall  describe  and  analyse  a  family  of  highly  parallel  special-purpose  computing  network* 
that  implement  multi-grid  algorithms  for  solving  elliptic  difference  equations.  These  networks 
have  the  same  characteristics  •  regularity,  local  comm  uncial  ion,  and  repetitive  use  of  a  single, 
simple  processing  element  •  that  make  systolic  architectures  attractive  [9].  These  architectural 
advantages  make  it  possible  to  build  large  computing  networks  of  VLSI  cells  that  would  be 
relatively  cheap,  reliable,  and  very  powerful. 
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Both  a  basic  and  a  fall  multi-grid  algorithm  are  considered.  The  basic  method  reduces  the 
error  in  a  given  initial  approximation  by  a  constant  factor  in  one  iteration.  The  full  method 
requires  no  initial  guess  and  produces  a  solution  with  error  proportional  to  the  truncation  error  of 
the  discretization.  These  algorithms  are  representative  of  many  variants  of  linear  and  nonlinear 
multi-grid  algorithms. 

The  analysis  assumes  that  we  are  solving  a  linear  system  originating  in  a  discretization  of  an 
elliptic  partial  differential  equation  on  a  rectangle  in  Rd,  using  a  regular  nd  point  grid.  The 
network  is  a  system  of  grids  of  processing  elements.  For  each  l  <  k  <  K,  processor  grid  Pk  has 
(nk)'T  elements,  where  7  is  an  integer  less  than  or  equal  to  d,  and  nK  —  n.  The  machine 
implements  a  class  of  multi-grid  algorithms  using  a  corresponding  system  of  nested  point  grids. 
For  each  1  <  k  <  K,  point  grid  Gk  has  (nk)d  points.  The  key  assumption,  which  is  quite 
realistic,  is  that  it  takes  this  machine  0((nk)<**'r)  time  to  carry  out  the  computation  required  by 
one  step  of  the  multi-grid  algorithm  on  point  grid  Gk  using  processor  grid  Pk> 

We  shall  consider  the  efficiency  of  these  parallel  implementations,  defining  efficiency  to  be  the 
ratio  of  the  speedup  achieved  to  the  number  of  processors  employed  [8].  We  shall  consider  a 
design  to  be  efficient  if  this  ratio  remains  bounded  by  a  positive  constant  from  below  as  n  —  00. 
The  analysis  will  show  that  when  7  <  d  some  algorithms  can  be  efficiently  implemented.  But 
when  7  »  d  (this  is  the  most  parallelism  one  can  reasonably  attempt  to  use)  no  algorithm  can  be 
efficiently  implemented.  There  does  exist,  in  this  case,  one  group  of  algorithms  for  which  the 
efficiency  falls  off  only  as  (log  n)Tl. 

The  analysis  assumes  that  we  implement  the  same  algorithms  used  by  uniprocessor  systems. 
Convergence  results  for  these  algorithms  have  been  rather  welt-developed  recently  [l,  5,  7].  We 
make  no  attempt  to  develop  algorithms  that  exhibit  concurrent  operation  on  several  grids.  Note, 
however,  that  some  encouraging  experimental  results  with  such  an  algorithm  have  been  obtained 
recently  by  Gannon  and  Van  Rosendale  [6]. 

In  any  discussion  of  the  practical  use  of  a  specialized  computing  device,  it  must  be 
acknowledged  that  overspecialization  can  easily  make  a  design  useless.  At  least,  the  designed 
device  should  be  able  to  solve  a  range  of  size  of  problems  of  a  particular  structure,  perhaps 
solving  large  problems  by  making  several  passes  over  the  data,  solving  a  sequence  of  smaller 
subproblems,  or  with  some  other  techniques.  We  shall  consider  how  a  large  grid,  with  (mn)^ 
points,  can  be  handled  by  a  system  of  processor  grids  with  n1  elements  each  having  Ofm^n**7) 
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memory  cells.  Problems  on  nonrectangular  domains  can  be  handled  by  techniques  requiring 
repeated  solutions  on  either  rectangular  subdomains  or  containing  domains. 

Brandt  [3]  has  also  considered  parallel  implementations  of  multi-grid  methods.  He  discusses  the 
use  of  various  interconnection  networks  and  appropriate  smoothing  iterations.  One  of  our  results, 
a  (log  n)2  time  bound  for  fully  parallel,  full  multi-grid  algorithms,  is  also  stated  in  his  paper. 

2.  Multi-Grid  Algorithms 

We  shall  consider  multi-grid  algorithms  in  a  general  setting.  The  continuous  problem  is 
defined  by  the  triple  {H,  a(u,v),  f(v)},  where  H  is  a  Hilbert  space  with  a  norm  ||.||,  a(u,v)  is  a 
continuous  symmetric  bilinear  form  on  H  x  H,  and  f(r)  :  H  —  R  is  a  continous  linear  functional. 
The  problem  is: 

Find  u  €  H  such  that  a(u,v)  —  f(v)  for  all  v  6  H.  (1) 

It  can  be  shown  that  if  a(  V)  satisfies  certain  regularity  conditions  (for  example,  that  a(v,v)  >  cQ 
||v||2  for  all  v  6  H),  then  Problem  (1)  has  a  unique  solution  [4]. 

We  consider  finite  dimensional  approximations  of  Problem  (1).  Let  M.,  j  >  1,  be  a  sequence  of 
N-*<iimensional  spaces,  on  which  one  can  define  a  corresponding  bilinear  form  a-(u.v)  and  a 
corresponding  continuous  linear  functional  fj(v),  which  are  constructed  to  be  approximations  to 
a(u,v)  and  f(v)  respectively.  Also,  since  the  multi-grid  algorithms  involve  transferring  functions 
between  these  spaces,  we  have  to  construct  extension  (interpolatory)  operators  Ej  :  M-  j  —  M-. 

We  shall  give  two  multi-grid  algorithms,  namely  BASICMG  and  FULLMG,  with  FULLMG 
calling  BASICMG  in  its  inner  loop.  The  two  algorithms  differ  in  that  BASICMG  starts  its 
computation  on  the  finest  grid  and  works  its  way  down  to  the  coarser  grids,  whereas  FULLMG 
starts  with  the  coarsest  grid  and  works  its  way  up  to  the  finest  grid.  In  the  conventional  single 
processor  case,  BASICMG  reduces  the  error  on  a  certain  grid  by  a  constant  factor  in  optimal 
time,  whereas  FULLMG  reduces  the  error  to  truncation  error  level  in  optimal  time. 

We  give  the  basic  multi-grid  algorithm  BASICMG  in  Table  (2-1).  This  is  a  recursive 
algorithm,  although  in  practice  it  is  usually  implemented  in  an  iterative  fashion.  The  iterations 
are  controlled  by  the  predetermined  parameters  (c.j,m).  In  this  sense  it  is  a  direct  method,  unlike 
related  adaptive  algorithms  which  control  the  iterations  by  examining  relative  changes  in  the 
residuals  [2].  Figure  (2-1)  illustrates  the  iteration  sequence  in  the  case  c  »  2.  The  major 
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computational  work  is  in  the  smoothing  sweeps  (subroutine  SMOOTH),  which  usually  consists  of 
some  implementation  of  the  successive-over- relax  at  ion  or  Jacobi  iteration  or  the  conjugate 
gradient  method.  The  smoothing  sweeps  are  used  to  annihilate  the  highly  oscillatory  (compared 
to  the  grid  spacing)  components  of  the  error  in  z  efficiently.  We  require  that  a  suitably 
”  parallel"  method,  Jacobi  or  odd-even  SOR  for  example,  be  used  as  the  smoother.  In  the  next 
section,  we  shall  discuss  new  architectures  for  implementing  these  smoothing  operations  in  more 
efficient  ways. 

Table  2-1:  BASICMG  Algorithm 
Algorithm  BASlCMG(k,iteJ,m>ak,fk) 

<  Computes  an  approximation  to  uk  €  Mk, 
where  aJc(u^,v)  «—  ffc(v)  for  all  v  €  Mk, 
given  an  initial  guess  z  €  Mk. 

Returns  the  approximate  solution  in  i. 

Reduces  initial  error  in  z  by  a  constant  faetor.> 

If  k  «—  1  then 

Solve  the  problem  using  a  direct  method.  Return  solution  z. 

else 

<  Smoothing  step  (j  sweeps):  > 

z  <-  SMOOTH(j,z,ak,fk). 

<Form  coarse  grid  correction  equation:  > 

*h(R’t)  “  \-i  for  4,1  r  €  Mk.,. 

< Solve  coarse  grid  problem  approximately  by  c  cycles  of  BASICMG  :> 
qe-O. 

Repeat  c  times: 

BASICMG(  k- 1  ,q,c,  j,m  .ajj.  j  ,bk  ^ ) 

<  Correction  step:> 

i  ^  i  +  Ekq. 

< Smoothing  step  (m  sweeps):  > 
z  m  SMOOTH( m,z,ak,fk). 

End  If 

End  BASICMG 

We  give  the  full  multi-grid  algorithm  FULLMG  in  Table  (2-2).  In  the  BASICMG  algorithm, 
the  choice  of  initial  guess  for  uk  is  not  specified.  In  practice,  good  initial  guesses  are  sometimes 
available  essentially  free  (for  example,  from  solutions  of  a  nearby  problem,  from  solutions  at  a 
previous  time  step,  etc.).  The  FULLMG  algorithm  interpolates  approximate  solutions  on  coarser 


Figure  3-1:  Iteration  of  BASICMG  for  k  ■  3  end  e  ™  2 


Level 

Coe  rse  1  ds  ds 

/  \  /  \ 

2  j  m*j  a 

/  \ 

Fine  3  j  * 

ds:  Direct  Solves  j.m  :  number  of  smoothing  sweeps. 


grids  as  intial  guesses  for  the  BASICMG  algorithm.  It  is  also  recursive  and  non- adaptive.  Figure 
(2-2)  illustrates  the  iteration  sequence  in  the  case  k  —  3,  c  ■»  2  and  r  «  1. 

Table  3-2:  FULL  MG  Algorithm 
Algorithm  FULLMGtkiS^yrteJtm,^,^) 

< Computes  an  approximation  *.  to  «k  6  Mk 

where  ajt(uk,r)  —  f.  for  all  v  6  Mfc, 

using  r  iterations  of  BASICMG, 

using  initial  guess  from  interpolating  the  approximate 

solution  obtained  on  the  next  coaser  grid. 

Solution  obtained  can  be  proven  to  have  truncation  error  accuracy. > 

If  k  *■  1  then 

Solve  the  problem  using  a  direct  method  to  get  iy 

else 

<  Obtain  solution  on  next  coarser  grid:> 

FULLMG(k-l,sk.l^,C4,m,ak.I,fk>l). 

interpolate  ^ 

*k  ~  Ek  *k.r 

< Reduce  the  error  by  iterating  BASICMG  r  times:  > 

Repeat  r  times: 

BASICMG(k,tk?c,j,m,ak,fk). 

End  if 

End  FULLMG 

We  would  like  to  summarise  briefly  the  accuracy  and  convergence  behaviour  of  the  above  two 
multi-grid  algorithms.  Since  the  main  emphasis  of  this  paper  is  on  the  algorithmic  aspects  of 
these  multi-grid  algorithms,  we  shall  refer  the  reader  to  the  literature  for  more  details.  The 
framework  presented  here  is  based  on  the  work  of  and  Bank  and  Dupont  [1]  and  Douglas  [5]. 
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Figure  3*2:  Iteration  of  FULLMG  for  k  ■  3,  e  ■  2  end  r«I 
Level 

1  ds  d*  ds  ds 

\  /  \  /  \  /  \ 

2  j  *  j  ■♦j  * 

\  /  \ 

3  j  ft 

ds:  Direct  Solves  j.ft  :  number  of  smoothing  sweeps. 

The  accuracy  and  the  convergence  of  the  BASICMG  algorithm  obviously  depend  on  the  three 
crucial  steps  of  the  algorithm:  smoothing,  coarse  grid  transfer,  and  fine  grid  correction.  The 
basic  requirements  are  that  the  smoothing  sweeps  annihilate  the  high  frequency  components  of 
the  error  efficiently,  the  coarse  grid  correction  q  be  a  good  approximation  to  the  fine  grid  error  in 
the  low  frequency  components,  and  the  interpolation  operators  (Ej's)  be  accurate  enough.  These 
conditions  can  be  formalised  into  mathematically  precise  hypotheses  which  can  then  be  verified 
for  specific  applications  ($].  Assuming  these  hypotheses,  one  can  show  that  Algorithm  BASICMG 
reduces  the  error  on  level  k  by  a  constant  factor  provided  that  enough  smoothing  sweeps  are 
performed.  Moreover,  it  can  be  shown  (see  Section  4)  that  Algorithm  BASICMG  (for  small 
values  of  c)  can  achieve  this  in  optimal  time,  i.e.  0(Nk)  arithmetic  operations.  Obviously,  the 
work  needed  depends  on  the  accuracy  of  the  initial  guess  and  increases  with  the  level  of  accuracy 
desired.  Often,  one  is  satisfied  with  truncation  error  accuracy,  i.e.  ||x  •  u||  ■»  0(||uk  *  u||)  <  C 
Nk*4  for  some  fixed  $  and  C  which  are  independent  of  k.  For  a  general  initial  guess,  the 
straightforward  application  of  Algorithm  BASICMG  to  reduce  the  initial  error  to  this  level  takes 
0(Nklog(Nk))  time,  which  is  not  optimal.  The  FULLMG  algorithm  overcomes  this  problem  by 
using  accurate  intis)  gnesses  obtained  by  interpolating  solutions  from  coarser  grids.  The 
convergence  result  for  Algorithm  BASICMG  can  be  combined  with  the  basic  approximation 
properties  of  the  various  finite  dimensional  approximations  {M-,  a.,  f.}  to  show  that  Algorithm 
FULLMG  computes  a  solution  xk  that  has  truncation  error  accuracy  in  0(Nk)  time. 


Coarse 


Fine 
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3.  The  Computing  Network 

In  this  section  we  describe  a  simple  parallel  machine  design  for  multi-grid  iteration.  We  restrict 
attention  to  linear  elliptic  problems  in  d  dimensions  over  rectangular  domains,  to  discretizations 
based  on  grids  of  nd  points,  and  to  multi-grid  methods  based  on  a  system  of  point-grids  { G  k }  f 
where  Gk  has  (nk)d  gridpoints,  with  mesh  lengths  h^.,  1  <  j  <  d,  the  finest  grid  has  nK  —  n, 
and 

nk+l  —  a  (nk  +  1)  -  1,  k  —  1,  2, ...  ,  K-l 
for  some  integer  a  >  2. 

The  machine  consists  of  a  system  of  processor-grids  { Pk}  £jtjo  responding  to  the  point-grids. 
Each  processor-grid  is  an  (nkP  lattice  in  which  a  processor  is  connected  to  its  2*y  nearest 
neighbors. 

We  shall  employ  a  standard  multi-index  notation  for  gridpoints  and  processors.  Let 
tj  ®  {0»  L  — t  n*l}. 

Let  s+f  an  (sjj)*,  the  set  of  s-tuples  of  nonnegative  integers  less  than  n.  We  shall  make  use  of  a 
projection  operator  :  »+f  —  sjjj^  defined  for  r  >  s  by 

ir))  —  (iI,  i,)- 

By  convention,  if  L  €  s^,  then  L“  (i,,  ....  if).  Also  let  L  —  (1,  ...  ,  1).  We  shall  also  use  the 
norm  |jj  m  jijl  +  ...  +  |ij  on  nj,. 

We  shall  label  the  gridpoints  in  Gk  with  elements  of  s*  d  in  such  a  way  that  the  point  with 
label  j.has  special  coordinates  (i,!^  ,,  ijhk^,  ....  idhk  d).  Similarly,  we  label  processors  in  Pk  with 
indices  in  a? 

Thus,  processors  Land  k  are  connected  if  |L-  kl  —  1.  In  order  to  make  the  machine  useful  for 
problems  with  periodic  boundary  conditions,  we  might  also  add  "wrap-around”  connections,  so 
that  L  and  k  are  connected  if  KL*  k)  nod  n|  «  1.  In  Section  3.1,  it  is  shown  that  periodic 
problems  ean  also  be  handled  without  theee  connections. 

Evidently,  if  each  processor  has  Ofo*7)  memory  ceils,  we  can  store  the  solution,  forcing 
function,  and  0(1)  temporary  values  belonging  to  the  whole  of  grid  Gk  in  the  processors  of  Pk; 
we  store  gridpoint  Lin  processor  jr](j)  for  L€  s* d. 

With  the  given  connectivity,  smoothing  sweeps  of  some  types  can  be  accomplished  in  Ofn*'7) 
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time.  It  is  not  necessary  for  the  stencil  of  the  difference  scheme  to  correspond  to  the  connectivity 
of  the  processor  grid.  Jacobi  or  odd/even  SOR  smoothing  can  be  so  implemented,  for  example. 
Let  t  be  the  time  taken  by  a  single  processor  to  perform  the  operations  at  a  single  gridpoint  that, 
done  over  the  whole  grid,  constitute  a  smoothing  sweep.  If  S  is  the  time  to  implement  a 
smoothing  sweep  over  the  whole  of  grid  Gk  on  processor- grid  Pk,  then 

S  -  t  nkd-'» .  (2) 

Grid  Pk  is  connected  to  grid  Pk+l-  Processor  i.  €  Pk  is  connected  to  processor  a(i+l)  •  l_  6 
Pk  r  These  connections  allow  the  inter-grid  operations  (forming  coarse  grid  forcing  terms  bk 
and  interpolation  Ek)  to  also  be  computed  in  O(S)  time.  We  refer  to  the  system  of  processor- 
grids  (Pt . Pj}  as  the  machine  Lj  for  J  —  I,  2 . K. 

The  execution  of  the  BASICMG  iteration  by  Lk  proceeds  as  follows. 

1.  First,  j  smoothing  steps  on  grid  Gk  are  done  by  Pk.  All  other  processor^grids  are 
idle. 

2.  The  coarse  grid  equation  is  formed  by  Pk  and  transferred  to  Pk+r 

3.  The  c  cycles  of  BASICMG  on  grid  k-1  are  performed  by  L^j. 

4.  The  solution  q  is  transferred  to  Pk  and  interpolation  Ekq  is  performed  by  Pk- 

5.  The  remaining  m  smoothing  steps  are  done  by  Pk. 

Let  W(n)  represent  the  time  needed  for  steps  1,2,4  and  5.  Then 

W(n)  —  (j  +  m  +  s)  t  n*"1  (3) 

where  s  is  the  ratio  of  the  time  required  to  perform  steps  2  and  4  to  the  time  needed  for  one 
smoothing  sweep.  Note  that  s  is  independent  of  n,  d  and  7. 

The  natural  way  to  build  such  a  machine  is  to  embed  the  7  -»  1  machine  in  two  dimensions  as 
a  system  of  communicating  rows  of  processors,  the  7  —  2  machine  in  three  dimensions  as  a 
system  of  communicating  planes,  etc.  Of  course,  realisations  in  three-space  are  possible  for  any 
value  of  7-  Gannon  and  Van  Rosendale  [0]  consider  the  implementation  of  the  fully  parallel 
machine  (7  “  d)  on  proposed  VLSI  and  Multi-Microprocessor  system  architectures. 

This  design  differs  from  systolic  array  designs  in  that  there  is  no  layout  with  all  wire  lengths 
equal.  But  for  reasonably  large  machines  the  differences  in  wire  length  should  not  be  so  great  as 
to  cause  real  difficulties.  Moreover,  one  need  not  continue  to  use  ever  eoarser  grids  until  a  Ixl 
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grid  is  retched.  Ia  practice,  3  or  4  levels  of  grids  could  be  used  and  most  of  the  multi-grid 
efficiency  retained;  this  would  make  the  construction  much  simpler. 


3.1.  Solwing  Larger  Problem* 

Suppose  there  are  (mn)d  gridpoints  and  only  a1  processors.  Assume  that  each  processor  can 
store  all  information  associated  with  m^n*'*  gridpoints.  Now  we  map  gridpoints  to  processors  in 
such  a  way  that  neighboring  gridpoints  reside  in  neighboring  processors.  To  do  this  we  define  a 
mapping  fm:  a+n  —  a*,  such  that,  for  all  i,  j  €  |fm(i)  •  fm(j)|  <  |i  *  j|,  as  follows.  Let  j  — 
qn  +  r  where  q  and  r  are  integers,  0  <  r  <  a-1.  Now  let 


f.u>  e 


(r  if  q  is  even 
n-l-r  if  q  is  odd. 


Now  if  m  is  even,  then  fm(0)  «  fn(mn-l)  —  0,  so  that  periodic  boundary  conditions  can  be 
handled  without  any  "wrap-around”  connections.  This  operation  corresponds  to  folding  a  piece 
of  paper  in  a  fan-like  manner;  for  m  —  10,  for  example,  like  this: 

AAAAA 


To  map  a  multidimensional  structure  we  fold  it  as  above  in  each  coordinate.  Let  the  processor- 
grid  have  nd  elements  and  the  point-grid  have  n^n  x  m^n  x  ...  x  mdn  points.  Point  i_  can  be 
stored  in  processor  F^mj,  ...,  md,  n;  i)  where  Fd(i)  ■  (fm  (ij),  ...»  If  have  only  a1 

processors  then  we  map  i.»nto  F^'U  where  FJ(i)  an  F^jtJO). 


4.  Complexity 

In  this  section,  we  are  going  to  analyse  the  time  complexity  of  the  two  multi-grid  algorithms, 
BASICMG  and  FULLMG,  as  implemented  by  the  different  architectures  just  discussed.  It  turns 
out  that  the  complexity  of  the  two  algorithms  is  rerj  similar.  Since  the  BASICMG  algorithm  is 
simpler  and  is  called  by  FULLMG,  we  shall  discuss  and  analyse  it  first.  After  that,  we  shall 
indicate  how  to  derive  the  results  for  FULLMG. 


4.1.  Complexity  of  BASICMG 

To  simplify  the  analysis,  we  shall  assume  that  the  computational  domain  is  a  rectangular 
parallelopiped  and  is  diacretited  by  a  hierarchy  of  cartesian  grids  (corresponding  to  the  Mj's)  each 
with  Uj  mesh  points  on  each  side  (denoted  the  n--grid).  Further,  we  assume  that  the  n^'s  satisfy 
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n.  —  a  (Oj.i+1)  -  1  where  t  is  u  integer  bigger  then  one.  Generally  ,  we  denote  by  T(n)  the 
time  complexity  of  the  BASICMG  Algorithm  on  an  n-grid.  By  inspecting  the  description  of 
Algorithm  BASICMG,  it  is  not  difficult  to  see  that  T(n)  satisfies  the  following  recurrence: 


T(an)  -  c  T(n)  +  W(an),  (4) 

where  W(an)  denotes  the  work  needed  to  preprocess  and  postprocess  the  (an)- grid  iterate  before 
and  after  transfer  to  the  coarser  n-grid.  We  have  the  following  general  result  concerning  the 
solution  of  (4),  the  proof  of  which  is  elementary. 


Lemma  It  Let  Tp  be  a  particular  solution  of  (4),  i.e. 

Tp(an)  -  c  Tp(n)  +  W(an), 
then  the  general  solution  of  (4)  is: 

T(n)  ■■  a  n10***  +  Tp(n),  where  a  is  an  arbitrary  constant. 


(5) 


(0) 


The  term  W(an)  includes  the  smoothing  sweeps,  the  computation  of  the  coarse  grid  correction 
equation  (i.e.  the  right-hand-side  b^)  and  the  interpolation  back  to  the  fine  grid  (Ekq).  The 
actual  time  needed  depends  on  the  architecture  used  to  implement  these  operations  (specifically 
the  dimensionality  of  the  domain  and  the  number  of  processors  available  on  an  n-grid).  In 
general,  as  derived  in  Section  3,  W(n)  is  given  by 


W(n)  —  (j+m+s)  t  g(n)  ■  0  g(n) ,  (7) 

where  g(n)  ■■  np  with  p  m  d -y  . 

In  Table  (4-1),  we  give  the  form  of  the  function  g(n)  as  a  function  of  the  architecture  and  the 
dimensionality  of  the  domain.  We  also  give  a  bound  on  the  total  number  of  processors  (P) 
needed  to  implement  the  architecture  and  note  that  it  is  always  the  same  order  as  the  number  of 
processors  on  the  finest  grid.  For  a  d-dimensional  problem  with  n7  processors  on  the  n-grid,  we 
have 

f  1  if  7  -  0, 

P(7)  -  { 

l  (a7/(a7-l))  u7  if  7  >  0. 

We  have  the  following  general  result  for  this  class  of  functions  g(n). 

Lemma  3:  If  W(n)  “  $  np,  then  we  can  take  the  following  as  particular  solution  of 
(4): 

f  0  (ap/(ap-c))  np  if  p  yd  log  c, 

Tp(n)  -  (8) 

'  jfnp  logtn  if  p  “  logte. 
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Tnblo  4-lt  Table  of  g(n) 


Architecture  1 

1-0 

1 

2-D 

1 

3-0 

1 

Tots)  •  of  processors 

1 

1 

1 

1 

on  a  1 i  grids.  P 

1  processor  1 

n 

1 

n2 

1 

a3 

1 

1  1 

n  processors  1 

1 

1 

n 

1 

a2 

1 

Ca/Ca-l))  n 

n2  processors! 

- 

1 

1 

1 

n 

1 

(a2/(s2-l))  n2 

n3  processors) 

- 

1 

- 

1 

1 

1 

(a3/ (a3-!))  n3 

Note:  Architecture  colusn  gives  nueber  of  processors  on  the  n-grid. 

Combining  the  results  of  the  last  two  lemmas,  we  arrive  at  our  main  result: 

Theorem  3 1  The  solution  of  (4)  for  W|n)  —  0  np  satisfies: 

$  (ap/(ap-c))  np  +  Ofn10**')  ifc  <  ap, 

T(n)  -"I  np  log%n  +  0(np)  if  c  ■■  ap,  (9) 

CKn1**.')  if  c  >  ap. 

Note  that  in  the  last  case,  a  yL  0  (in  Equation  (6))  because  Tp(n)  does  not  satisfy  the  boundary 
conditions. 

For  the  first  two  cases  (c  <  ap),  we  can  determine  the  highest  order  term  of  T(n)  explicitly. 
However,  for  the  case  c  >  ap,  the  constant  in  the  highest  order  term  depends  on  the  initial 
condition  of  the  recurrence  (4)  (i.e.  the  time  taken  by  the  direct  solve  on  the  coarsest  grid),  which 
is  more  difficult  to  measure  in  the  same  units  as  that  of  the  smoothing  and  interpolation 
operations.  Fortunately,  the  complexity  for  this  ease  is  non*optimsi  and  thus  not  recommended 
for  use  in  practice  and  therefore,  for  our  purpoee,  it  is  not  neeessary  to  determine  this  constant. 

Based  on  the  results  in  Theorem  3  and  the  specific  forms  of  the  function  g(n)  in  Table  4*1,  we 
can  compute  the  time  complexity  of  Algorithm  BASICMG  for  various  combinations  of  c,  a  and  p, 
some  of  which  are  summarised  in  Table  4*2,  where  we  tabulated  the  highest  order  terms  of 
T(n)/0. 

The  classical  one  processor  (*y  ■■  0)  optimal  time  complexity  results  [1,  5j  are  contained  in 
these  tables.  For  example,  in  two  dimensions  (d  “  2)  g(a)  «  n2  and  Table  4*2  shows  that,  for 
the  refinement  parameter  a  —  2,  c  <4  gives  an  optimal  algorithm  (0(n2))  whereas  c  >  4  is  non* 


Table  4*2:  Time  Complexity  T(n)/$  of  Algorithm  BAS1CMG 


— — - - -♦ 

I  p  =  d  ~  7  I 

lei  3  I  2  I  1  I  0  I 

— — -- — 

I  1  I  8/7  n3  I  4/3  n2  I  2  n  *  logjB  I 
— —-——***********—— — — -♦ 

=2  121  8/6  n3  I  4/2  n2  •  BlogjB  I  0(b)  I 

♦  — 

I  3  I  8/5  fl3  I  4  b2  *  0(n,o*23)  |0(nlo*23)  | 

♦  -*****<*****-—  —  ------- 

I  4  I  8/4  B3  *  fl2log2B  I  0(b2)  I  0(fl2)  I 

I  p  =  d  -  7  I 

lei  3  I  2  I  1  I  0  I 

I  1  I  27/28  b3  |  9/8  b2  I  3/2  b  •  log3n  I 

*3  121  27/25  b3  |  9/7  b2  |  3b  •  OCb'0**2)  I 

I  3  I  27/24  b3  |  9/6  b2  *  nlogsn  I  0(b)  I 

♦  — 

I  4  |  27/28  fl3  I  9/5  b2  •  Q(n,0*»4)  I0(n'°«34)  I 

I  p  s  d  -  7  I 

lei  3  I  2  I  1  I  0  I 

- * - -♦ 

I  1  I  64/63  b3  116/15  n2  I  4/3  b  •  log4B  I 

S  4  121  84/82  b3  116/14  b2  I  4/2  b  *  0(n,o*«2)  | 

I  3  I  64/61  b3  118/13  B2  I  4  b  •  0(n'°*«3)  | 
♦  - — ♦ 

I  4  I  84/80  a3  118/12  b2  •  bIo§4b  I  0(b)  I 

♦- — - ------ - 

a  :  aash  rafinaaant  ratio 
e  :  nuabar  of  correetioe  eye  fas  in  8ASICNG 
7  :  n"1  proeassors  ob  fcha  n-grid 
d  :  di sans  ion  of  the  doaain 
Asyaptotie  Sfficianey  Boundary  (Saa  Theoraa  5) 
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optimal.  More  generally,  we  have  an  optimal  scheme  if  and  only  if  c  <  ad.  For  example,  the 
larger  a  is  or  the  larger  d  is,  the  larger  the  value  e  can  take  for  the  algorithm  to  remain  optimal. 
However,  with  a  larger  value  of  a,  more  relaxation  sweeps  are  needed  and  a  larger  value  of  c  has 
to  be  taken  in  order  to  achieve  the  same  accuracy.  We  note  that  the  constant  in  the  highest 
order  term  of  T(n)  does  not  vary  a  great  deal  with  either  c  or  a  (they  are  all  about  one,  especially 
for  the  larger  values  of  a).  This  suggests  that  the  best  balance  between  speed  and  accuracy  can 
be  achieved  by  choosing  the  largest  value  of  e  (or  close  to  it)  such  that  the  algorithm  remains 
optimal.  When  d-7  *■*  2  and  a  -■  2,  this  means  taking  c  to  be  2  or  3. 

4.2.  Efficiency,  Speedup,  Accuracy,  and  Optimal  Design 
Next,  we  are  going  to  look  at  the  effects  of  the  new  architectures  on  the  performance  of  the 
BASICMG  algorithm.  There  are  four  parameters  in  this  study:  c,  a,  7  and  d.  We  shall  call  a 
particular  combination  of  these  four  parameter  a  design.  We  shall  use  the  notation  T(c,a,7,d)  to 
denote  the  corresponding  complexity  of  the  design.  One  of  the  main  issues  that  we  would  like  to 
address  is  the  efficiency  E  and  speedup  S  of  a  particular  design,  which  are  defined  as  [8]: 

Definition  4: 

S(e,a,7,d)  —  T(c,a,0,d)  /  T(e,a,7,d) , 

E(c,a,7,d)  -  T(cfa,0,d)  /  (  P(7)  T(c,a,7,d) ) . 

The  speedup  S  measures  the  gain  in  speed  over  the  one  processor  architecture  while  the  efficiency 
E  reflects  the  tradeoff  between  processors  and  time  and  measures  the  efficiency  with  which  the 
architecture  exploits  the  extra  processors  to  achieve  the  speedup.  The  optimal  efficiency  is  unity, 
in  which  case  a  P-fold  increase  in  the  number  of  processors  reduces  the  time  complexity  P-fold. 
In  general,  the  efficiency  E  and  the  speedup  S  are  functions  of  n.  We  shall  call  a  design 
asymptotically  efficient  if  E  tends  to  a  constant  as  n  tends  to  infinity  and  asymptotically 
inefficient  if  it  tends  to  zero.  We  shall  primarily  be  concerned  with  analysing  the  efficiency  E  of 
a  design  in  this  section.  The  speedup  S  can  be  easily  read  from  Table  4-2. 

For  determining  the  asymptotic  efficiency  of  a  design,  it  suffices  to  determine  the  highest  order 
term  of  E.  The  efficiency  E  can  be  derived  from  the  explicit  expressions  for  T  and  P  in  a 
straightforward  manner.  Since  the  efficiency  for  1  ■  0  is  unity  by  definition,  we  shall  only  be 
interested  in  7  >  0.  We  summarize  the  results  in  the  following  theorem. 


load  U|j  la.  for  k  «  3,4 . q .  At  tuna  4.  this  special  functloa  It.  required  at  coll 

(2.2)  again: 


0 


This  rotation  follows  the  earlier  one  down  the  row,  removing  elements  Uj*  and 
loading  zeros  In  general,  cell  (k.Jt)  performs  the  special  function  k  times,  at 

,t  -  Zk  - 1 . 3k  -  2.  Then  k  identity  rotations  flow  down  row  k,  pushing  out  rows 

k,k  - 1 . 1  of  V .  and  finally  loading  the  2ero  that  precedes  the  next  matrix. 

To  make  the  array  output  uniform,  we  would  add  some  cells  at  the  lower  left 
to  make  the  array  a  rectangle: 


The  only  purpose  of  these  diamond-shaped  cells  is  to  delay  Anal  output  of  ele* 
rr.ents  of  U.  which  now  leave  the  array  in  the  same  format  as'eiements  of  A  — 
element  i,J  leave*  at  relative  time  m  -i  j  . 

3.  The  Backsolve  Array 

To  solve  the  triangular  system  UrY  -  B  (  Y  and  B  are  m.xn)  we  can  use  the  tri¬ 
angular  array  shown  in  Figure  6.  The  details  of  this  array  are  straightforward 
and  are  omitted.  We  note  that  It  consists  of  a  triangular  array  of  cells  each  con¬ 
taining  a  single  element  of  V  exactly  as  does  ths  CK  array,  so  that  it  might  In 
some  applications  be  useful  to  build  a  common  realization  of  both  these  arrays 

In  the  present  context,  n  is  often  large.  We  intend  to  solve  two  systems, 
UTY  s  B,  then  UX  *  Y\  we  are  not  otherwise  interested  in  elements  of  Y.  If  AT  is 
stored  we  can  store  Y  in  its  place.  Suppose,  however,  that  X  will  not  be  stored. 
We  may  want  to  minimize  temporary  storage  for  Y.  This  can  be  reduced  to  0 
(m*)  locations  in  two  ways.  We  could  use  a  second  array  to  solve  UX  -  Y  and 
stream  the  first  array's  output  into  this  second  array.  An  interface  of  3m(m  - 
1)  /2  delay  cells  is  needed,  as  Figure  7  shows.  There  is  another  possibility. 
One  array  can  solve  both  systems  at  the  same  time.  Figure  8  shows  two  succes¬ 
sive  cycles  of  such  a  device.  At  a  given  instant,  every  second  diagonal  is  working 


•o  the  two  matrices  are-separated  by  a  line  of  zeros.  The  scheme  will,  in  effect, 
push  out  V  as  it  is  created  and  fill  each  cell  with  a  zero  just  before  a  B  -  element 
reaches  it.  It  therefore  "looks"  to  the  new  matrix  B  as  if  the  array  initially  con¬ 
tained  only  zeros. 

When  a  B  -  element  first  arrives  at  a  boundary  cell  it  meets  a  zero  (we  shall 
later  show  this).  The  boundary  cell's  normal  function  (see  Figure  2)  is  to  store 
this  element's  absolute  value  in  its  memory  and  output  the  identity  rotation,  if 
the  element  was  non-negative,  or  - 1  times  the  identity  rotation  otherwise.  As 
this  rotation  moves  to  the  right  it  meets  cells  containing  zeros  In  their 
memories  and  pushes  these  zeros  out.  loading  instead  elements  (possibly 
negated)  of  row  n  of  B.  Thus  the  zeros  continue  to  lead  the  columns  of  B  down 
through  the  array. 

We  now  show  how  elements  of  U  are  unloaded.  Let  time  t  -  O  be  the  time 
a„  enters  cell  (1.  1).  Then  cell  (i.;)  accepts  its  last  A  -  element,  and  computes 
Its  U  -  element,  thereby  finishing  its  work,  at  time  t  *  t  «*■  j  -  2.  By  our  assumed 
sequence  of  inputs  (4)  the  datum  zero  appears  at  the  input  to  cell  (l ./)  at  time 
j .  immediately  after  it  has  computed  .  To  make  the  scheme  work,  we  want 
an  identity  rotation  to  get  there  at  the  same  time,  knocking  out  the  computed 
element  u,j  and  loading  the  zero.  This  will  be  made  to  happen  by  a  special 
boundary  cell  function. 

Let  cell  (1.1)  do  this  at  time  1: 


0 


U11 


The  rotation  to  generated  will  reach  cell  (1  J)  et  time  j,  ee  required.  Now.  et 
Uint  a.  let  oeU  (2,e)  do  in*  iuri  tfting: 


Ui2 


(The  datum  ulX  has  been  forced  out  of  the  first  row,  as  described  above).  This 
rotation  will  move  to  the  right  and  knock  elements  utJt  out  of  cells  ( Z.k )  and 


1)  /  Z,  of  the  array.  What  is  the  best  shape  possible?  Two  conflicting  factors 
influence  the  decision.  The  1/0  bandwidth  to  support  the  array  is  least  for  a 
"well-rounded"  array  (p**g)  since  only  the  cells  at  the  array  edge  communicate 
with  the  surrounding  systems.  But  the  number  of  passes  needed  to  solve  the 

!lven  problem  will  usually  not  be  minimized  by  taking  p  *  7.  In  typical  cases 
say  m  =  100,  pg  -p(p  - 1)  /2  a  31  )  the  minimizing  shape  may  be  quite  narrow 
p  *  2.  q  a  16  in  this  example). 

2.3.  Unloading  the  Cell  Memories 

The  QU  trapezoid  implements  the  matrix  factorization  (3).  But  how  can  we 
remove  the  elements  of  C/n  and  U 12.  which  are  stored  in  the  cells  of  the  array? 
Here  we  shall  develop  a  scheme  with  these  properties: 

1}  The  outward  flow  of  data  is  entirely  uniform. 

2)  Control  signals  are  applied  only  at  the  boundary  cells. 

3)  No  specially  controlled  functions  are  required  of  the  internal  cells. 

4)  The  array  can  finish  the  factorization  of  a  matrix,  unload  its  cell  memories, 
and  begin  the  factorization  of  another  matrix  with  no. delay  whatever. 

The  key  to  the  unloading  scheme  is  the  way  an  internal  cell  behaves  when 
given  the  "identity"  rotation  (c  *  1.  s  *  0  ).  It  acts  as  a  umt-delay: 


x 

± 

y 


To  begin,  suppose  a  new  matrix  B  follows  the  input  matrix  A.  The  data  will 
be  presented  to  the  array  in  this  format, 
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br* 

• 

• 

hm 

b„l 

0 

• 

0 

-  a 

12 

hi 

a22 

• 

• 

• 

a 

cm 

stored  in  the  bank  above  the  cell  it  enters  first.  A  column  (of  temporary  results) 
that  emerges  from  the  bottom  is  sent  to  the  bank  above  the  cell  it  will  next 
enter.  When  the  passes  are  sequenced  as  in  Figure  4,  this  destination  is  for 
some  columns  uniquely  determined,  and  for  others  is  one  of  two  possibilities. 
Thus,  the  extra*array  interconnections  are  very  simple. 

Control  of  the  memories  is  also  simple,  since  the  pattern  of  access  to  the 
data,  (Figure  7)  is  so  regular.  Memory  addresses  could  be  generated  once  and 
passed  from  one  bank  to  the  next. 

2.2.1.  An  alternate  scheme 

There  is  another  possibility.  We  could  also  work  with  tilei  of  these  shapes, 


The  array  implements  tile  11  by  using  all  internal  cells  actively,  tile  1  in  the 
same  way,  but  with  columns  brought  in  reverse  order,  and  tile  III  by  shutting  of! 
certain  internal  cells.  Tile  111  can  be  realized  only  if  it  fits  into  the  array  -  if 
2(p  -1)  «  ff  * P  ■ 

Comparing  the  two  possibilities  we  see  that  the  first  has  an  average  tile  aize 
of  p[q  -p)  whue  the  second  has  2 p(g  - 1)  /3  (the  ules  are  used  to  cover  rectan¬ 
gles  like  this.) 


Thus,  the  first  scheme  is  more  efficient  ifp(g  -p)  >  2p(g  - 1)  /3.  i.e.  If  g  >  3p  • 
2  .  Since  the  second  scheme  requires  that  q  be  at  least  3p  -  2.  it  can  never  be 
more  efficient  than  the  first. 

2.2.2.  Choosing  the  Array  Shape 

We  suppose  that  some  constraint,  cost  for  example,  Limits  the  size,  t»g  -p(p  - 


compute  the  factorization 


Q'B 


UXi  V* 

.  0 


(3) 


where  0  denotes  the  zero  matrix  order  (n-p)xp,  £/,,  is  pxp  upper  triangular, 
and  Uxt  and  Bt  are  full  (q-p)  -  column  matrices.  Bt  emerges  from  the  bot- 
tom  of  the  array;  UXI  and  U\z  are  stored  in  it. 

Can  we  obtain  a  complete  QU  factorization  with  the  trapezoid?  If  q  is  not 
lets  than  m.  the  number  of  columns  of  A  we  can;  we  would  zero  A  in  groups  of  p 
columns,  from  left  to  right,  using  fm  /p\  passes  through  the  array.  The  details 
are  obvious.  But  if  m  >  q  we  cannot.  For  suppose  we  zero  the  first  p  columns  of 

A  below  the  diagonal  by  passing  columns  1.2 . q  through  the  array.  We  want 

next  to  zero  columns  p  «■  1 . 2p.  by  passing  p  +  1.  i.  .  ,p.  ▼  q  through.  But 

until  we  have  applied  the  rotations  from  the  first  pass  to  columns 
q  ■*  1.  .  .  .  .  q  **•  p ,  we  may  not  apply  those  of  the  second  pass. 

To  allow  factorization  of  matrices  with  more  than  q  columns,  the  array  must 
provide  a  second  function:  the  ability  to  apply  previously  computed  (and  stored) 
rotations  to  a  set  of  input  columns.  We  can  give  the  array  this  ability  by  turning 
off  tne  cells  m  the  leftmost  pxp  triangular  part,  keeping  a  p*(q  *p)  rectangle 
active.  We  also  allow  rotation  parameters  to  come  in  wa  the  left  edge.  A  set  of 
g  -p  columns  can  enter  the  active  rectangle  at  the  top.  The  result  -  the  input 
rotations  applied  to  the  input  columns  —  emerges  from  the  bottom,  except  for 
the  first  p  rows,  which  are  stored  in  the  cells  of  the  active  rectangle. 

2.2.  Simulating  The  Full  Array 

Here  we  show  how  the  pxq  trapezoidal  array,  supported  by  an  appropriate 
memory  system  for  partial  results,  can  generate  a  QU  factorization  when 
m  >  q .  Imagine  that  the  set  of  work  to  be  done  is  represented  by  an  mxm  tri¬ 
angular  array  of  pairs, 


(i j)  :  1 


where  the  pair  (i.;)  represents  the  task  of  applying  to  column  j  the  rotations 
used  to  zero  elements  of  column  i.  The  array  can  be  ueed  to  perform  ’’gen¬ 
erate"  passes,  where  columns  are  actually  zeroed,  and  "apply"  passaa  where 
stored  rotations  are  applied.  A  generate  pest  performs  a  pxq  trapezoidal  piece 
of  the  aet  of  task  pairs;  an  apply  pass  performs  a  px(q-p)  rectangular  piece. 
Sequencing  the  passes  to  perform  the  entire  job  is  analogous  to  covering  a  trian¬ 
gle  by  trapezoidal  and  rectangular  "tiles"  following  thsse  rules: 


SRI)  Trapezoidal  tiles  must  be  placed  at  the  triangle  s  diagonal  edge; 

R2)  No  tile  may  be  placed  unless  the  diagonal  edge  to  its  left  has  been  tiled; 
R3)  No  tile  may  be  placed  if  any  space  directly  below  it  ls  untiled. 

There  are  many  legal  tilings;  Figure  4  shows  one. 

In  generating  the  QU  factorization  by  multiple  passes,  temporary  results 
are  produced.  These  must  be  stored  and  reentered  into  the  array  later.  Figure 
5  shows  a  suitable  memory  design.  The  important  feetures  are  these.  There  is  a 
separate,  independent  memory  bank  for  each  array  column.  A  matrix  column  is 


directions:  complex  matrices,  unloading  the  result,  U,  from  the  array,  control 
and  synchronization,  fabrication  of  the  cells,  and  simulation  of  a  large  array  by  a 
physically  smaller  array  through  decomposition  of  the  problem. 

QU  factorization  of  A  is  performed  by  finding  an  orthogonal  matrix,  Q *. 
such  that  Q'A  -  U  is  upper-triangular.  Q*  can  be  a  product  of  simple  orthogo¬ 
nal  matrices  (Givens  rotations)  each  chosen  to  zero  one  element -of  A  below  the 
diagonal.  To  zero  a^j  .  the  i**  and  (i  -  1)**  rows  of  A  are  replaced  by 


a»-ia 

c  s 

Oi-IM 

-s  c 

Ov* 

k  =  l.  2. 


m 


where  (c,s)  are  chosen  so  that  becomes  zero  and  the  matrix  shown  is 
orthogonal: 


c  = 


±=LL 


iJ 


s  s 

Vat-ij  «■  o 

The  elements  can  be  zeroed  column  by  column  from  the  bottom  up;  elements 
ere  zeroed  in  the  sequence 

(n.  1).  (n— 1,  1) . (2.  1);  (n.  2) . (2.  2);  •  •  •  ;  (n,  m) . (m  +  l.m) 

Note  that  the  rotation  zeroing  needs  to  be  applied  only  to  columns 

+ 1 . m  since,  at  the  time  it  is  applied,  the  elements  and  o»j,  for 

k  <  j  are  already  zero. 

The  Gentleman-Kung  (GK)  array,  an  mxm  triangular  array  of  two  cell  types 
that  computes  the  QU  factorization  (1)  of  an  nxm  input  matrix  A  ,  la  shown  in 
Figure  1.  Wo  oall  the  circles  "boundary"  coll*.  Figure  2  *how*  the  cell's  func¬ 
tions.  We  shall  now  explain  the  GK  array's  operation.  First,  note  that  the  cells 
each  have  a  single  memory.  Thaee  initially  are  sera.  The  diagonal  boundary 
oeUs  compute  rotation  parameters  (e.e).  Coll  {J .;)  computes  the  rotations  that 
Mro  elements  of  oolumn  y  of  A.  These  rotation*  then  move  right,  along  the  row* 
of  the  array.  The  square  "internal"  cells  apply  these  rotation*  to  the  other 
columns. 

The  matrix  A  enters  the  array  at  the  top  in  the  pattern  shown  in  Figure  l. 
To  the  upper  left  of  each  cell  is  the  time  that  the  first  element  of  A  arrives.  Sup¬ 
pose  On.,  enter*  cell  (1.1)  at  time  t  =  l.  The  first  element  of  U,  uM  ..is  com¬ 
puted  in  cell  (1.1)  at  time  t  *  n.  By  time  t  =  n  +  2(m  - 1)  the  last  element. 
Uip.m  .  has  been  computed.  U  now  resides  in  the  array.  The  rotations  defining 
Q  will  have  emerged  from  the  right  edge. 

2. 1.  The  Trapezoidal  Subarray 

We  would  like  to  solve  beamforming  problems  of  various  sizes  using  one  physical 
array,  so  we  must  consider  how  to  simulate  a  full  mxm  array  using  a  smaller 
piece.  Suppose  we  have  a  pxq  trapezoid,  as  shown  In  Figure  3.  Let  B  be  a 
matrix  having  n  rows  and  q  columns.  If  B  is  presented  at  the  array  top,  we 


wbere.c.  is  the  desired  signal  vector,  and  Rm  is  the  covariance  matrix  of  the  sig¬ 
nal  at  frequency  u : 

/?»(«)  *  r|x(w)x#(o)J 


In  practice,  for  every  interesting  frequency,  an  nxm  matrix  of  samples  of  the 
signal 


X'{u) 


would  be  obtained,  and  H  estimated  by 


•  R  *XX‘. 


(Here,  *  denotes  conjugate  transpose).  Possibly,  different  weights  would  be 
given  to  the  rows  of  X  \u). 

The  following  algorithm  gives  the  solution. 


!.  Factor 

X‘  s  Q  If 

where  Q  is  an  nxn  unitary  matrix,  and  if  is  the  nxm  matrix 


(1) 


2.  For  each  bearing-angle  d  , 


(a)  forward  solve: 

=  n(-d) 

(2a) 

(b)  back  solve: 

U  iu(d)  *  n(d)  (a  *a)  “l 

(2b) 

(Here  .a  *  son  a.  =  a*  U~l(U*)~l. c.  *  £.'R^l£.) 

For  the  remainder  of  this  paper  we  shall  concentrate  on  the  design  of  an 
adaptive  weight-selection  processor  that  performs  the  two  major  steps  of  the 
algorithm.  Two  systolic  arrays  will  be  used.  One,  a  variant  of  the  design  of  Gen¬ 
tlemen  and  Kung  for  QU  factorizations  [2],  performs  step  1.  The  second  does 
the  triangular  solves  of  step  2  and  is  new. 


2.  The  QU  -  factorization  Processor 

This  section  is  an  extension  of  previous  results  of  Gentlemen  and  Kung  in  several 
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1.  Introduction 


Several  recent  contributions  to  the  literature  in  signal  processing,  computer 
architecture,  and  VLSI  design  showed  that  systolic  arrays  are  extremely  useful 
for  designing  special  purpose  high-performance  devices  to  solve  problems  m 
numerical  linear  algebra  [1.2.3. 4. 5].  But  no  attention  has  been  paid  to  the  prob¬ 
lems  of  integrating  these  designs  into  any  computing  or  signal  processing 
environment.  The  purpose  of  this  paper  is  to  examine  systolic  arrays  in  a 
specific  context.  We  have  chosen  an  adaptive  beamforming  problem  as  that  con¬ 
text. 


The  adaptive  beamforming  problem  chosen  is  simple,  yet  typical  of  those 
encountered  in  sonar  signal  processing  applications.  The  signals  of  a  large  array 
of  m  identical  sensors  are  sampled,  stored  and  Fourier  transformed  in  time. 
The  result  is  a  set  z(u,i)  of  complex  values  depending  on  frequency  (u)  and  sen¬ 
sor  (t)  .  Then,  for  each  resulting  frequency  u  an  array  output  function  g  . 
depending  on  a  bearing-angle  d,  is  produced,  by 

C  («•*)  *  £  -(«.-)  &  (i.w.d) 

<  ■  i 


Here  overbar  denotes  complex  conjugate.  The  vector  iu  determines  the  charac- 
tarlsUea  of  the  baamforraer  For  the  minimum-variance  dlttortlonioaa  raaoonaa 


baamformer,  Ud  la  ehoaen  to  minimise  the  output  power,  the  expeoted  value 
|p  I1  ,  subject  to  a  eigne.'. -protection  constraint 


2  c(<,«.d)C3(i,w,d)  *  1 

<  a  1 


Here  c  (t.u.d)  is  the  output  of  seosor  i  at  frequency  u  given  no  signal  other  than 
that  coming  from  a  source  at  bearing-angle  iS  . 

The  solution  is  to  chooss  the  weight  vector 

su(u.d)  «  (tw(l.u.d) . ti/(m.u.d))r 


as 


•  Permanent  eddraea:  Department  of  Computer  Science,  Stanford  University.  Stanford,  CA  W06 


References 


[1]  R.  E.  Bank  and  T.  Dupont. 

An  optimal  order  process  for  solving  elliptic  finite  element  equations. 

Math.  Comp.  36:35*51,  1981. 

[2]  A.  Brandt. 

Multi-level  adaptive  solutions  to  boundary-value  problems. 

Math.  Comp.  31:333-390,  1977. 

[3]  A.  Brandt. 

Multi-Grid  Solvers  on  Parallel  Computers. 

Technical  Report  80-23,  ICASE,  NASA  Langley  Research  Center,  Hampton,  VA,  1980. 

[4]  P.  G.  Ciarlet. 

The  Finite  Element  Method  for  Elliptic  Problems. 

North-Holland,  Amsterdam,  1978. 

[5]  C.  Douglas. 

Multi-Grid  Algorithms  for  Elliptic  Boundary-Value  Problems. 

PhD  thesis,  Yale  University,  1982. 

Computer  Science  Department  Technical  Report  223. 

[6]  D.  Gannon  and  J.  van  Rosendale. 

Highly  Parallel  Multi-Grid  Solvers  for  Elliptic  PDEs:  An  Experimental  Analysis. 
Technical  Report  82-36,  ICASE,  NASA  Langley  Research  Center,  Hampton,  VA,  1982. 

[7]  W.  Hackbusch. 

A  Multi-Grid  Method  Applied  to  a  Boundary  Value  Problem  with  Variable  Coefficients 
in  a  Rectangle. 

Technical  Report  77-17,  Mathematisches  Institut,  Universitat  zu  Koln,  Cologne,  1977. 

[8]  David  J.  Kuck. 

The  Structure  of  Computers  and  Computations. 

Johm  Wiley  St  Sons,  New  York,  1978. 

[9]  H.T.  Kung. 

Why  Systolic  Architectures? 

IEEE  Computer  15(l):37-46,  January,  1982. 


17 


unity  for  the  reluct  of  a  end  p  that  occur.  The  r  part  of  the  constant  applies  equally  to  alt 
entries  of  Table  4-2  and  reflects  the  number  of  times  BASICMG  is  called  by  FULLMG.  We  point 
out  that  the  choice  of  r  that  results  in  truncation  error  level  accuracy  depends  on  how  efficiently 
BASICMG  reduces  its  initial  error  but  can  be  chosen  independent  of  n  [5].  Thus,  the  extra 
multiplicative  factor  does  not  affect  the  asymptotic  efficiency  of  a  particular  entry  in  Table  4-2. 
The  complexity  of  F(n)  in  the  last  case  is  actually  increased  by  a  factor  of  Iogmn  over  that  of 
T(n).  However,  the  corresponding  entries  in  Table  4-2  are  already  asymptotically  inefficient  and 
thus  this  extra  factor  again  does  not  affect  the  asymptotic  efficiency  of  the  design.  It  follows 
that  the  discussions  concerning  the  efficiency,  speedup  and  optimal  design  for  the  BASICMG 
algorithm  in  Section  4.1  are  also  valid  for  the  FULLMG  algorithm,  with  the  exception  that  a 
fully  parallel  logarithmically  asymtotically  efficient  design  is  slower  by  the  factor  log^n. 

6.  Conclusion 

We  have  proposed  an  architecture  based  on  a  system  of  processor- grids  for  parallel  execution  of 
multi-grid  methods  based  on  a  system  of  point-grids.  We  have  analyzed  its  efficiency  and  shown 
that  a  combination  of  algorithm  and  machine  is  asymptotically  efficient  if  and  only  if  c  <  ad‘7, 
where 

•  c  is  the  number  of  coarse  grid  iterations  per  floe  grid  iterations, 

•  a  is  the  mesh  refinement  factor, 

•  d  is  the  dimension  of  the  point-grids, 

•  7  is  the  dimension  of  the  processor- grids. 

We  find  therefore  that  fully  parallel  designs  —  with  7  —  d  —  cannot  be  asymptotically 
efficient.  There  is,  however,  only  a  logarithmic  fall-off  in  efficiency  when  c  —  a*1*7,  and  for  fully 
parallel  designs  this  occurs  for  c  —  1. 
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4.3.  Complexity  of  FULLMG 

Id  this  section,  we  shall  derive  the  complexity  of  the  FULLMG  Algorithm.  Since  FULLMG 
I  calls  BAS1CMG,  the  results  here  depend  crucially  on  the  complexity  of  Algorithm  BASICMG. 

Let  F(n)  denote  the  time  taken  by  one  call  to  FULLMG.  By  inspecting  the  algorithm  in  Table 
2*2,  it  can  easily  be  verified  that  F(n)  satisfies  the  following  recurrence: 

F(an)  —  F(n)  +  r  T(an) ,  ( 10) 

where  we  have  absorbed  the  cost  of  the  interpolation  step  in  FULLMG  into  the  interpolation 
costs  of  BASICMG  (i.e.  the  term  s  in  Equation  (7)).  Note  that  this  is  just  a  special  case  of  the 
recurrence  (4),  with  c  «■  1  and  W(an)  r  T(an).  By  inspecting  the  entries  in  Table  4-2,  we  see 
that  the  forcing  function  W(n)  in  this  ease  takes  the  form  of  either  np  or  nplogkn.  We  hare  the 
following  result  for  particular  solutions  of  (10)  for  this  class  of  forcing  functions,  which  can  be 
verified  by  direct  substitution. 

Lemma  6: 

(1)  If  T(n)  «■  anp  then  a  particular  solution  of  (10)  is: 

Fp(n)  —  (ap/(ap-l))  ronp  . 

(2)  If  T(n)  «  anplogkn,  with  p  >  0,  then  a  particular  solution  of  (10)  is: 

Fp(n)  *■  (ap/(ap-l))  ranplog&n  -  (ap/(ap-l)2)  ronp  . 

(3)  If  T(n)  *  alog^n  then  a  particular  solution  of  (10)  is: 

Fp(n)  —  (ra/2)  (  logjn  +  log^n  ) . 

Since  the  homogeneous  solution  of  (10)  is  a  constant,  the  general  solution  is  dominated  by  the 
particular  solutions.  We  give  the  highest  order  terms  of  F(n)  in  terms  of  T(n)  in  the  following 
theorem. 

Theorem  7: 

(1)  If  T(n)  —  onp  then  F(n)  —  (ap/(aM))  rT(n)  +  0(1). 

(2)  If  T(n)  —  onplogfcn,  with  p  >  0,  then  F(n)  (ap/(ap-l))  rT(n)  +  0(np). 

(3)  If  T(n)  —  olog^n  then  F(n)  —  (r/2)  Iogkn  T(n)  +  0(logftn). 

In  the  first  two  cases,  the  complexity  of  F(n)  is  the  same  as  that  of  T(n),  except  for  the 
e onttant  multiplicative  factor  (ap/(ap-l))r.  The  (ap/(ap-l))  part  of  this  constant  is  very  close  to 
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Next  we  shall  fix  a  and  7  and  rary  e  (columns  is  Table  4*2).  That  is,  we  fix  the  architecture 
aad  vary  the  multi-grid  algorithm.  This  time  the  speedup  factor  S  decreases  slightly  as  we 
increase  c  which  is  not  surprising  aa  we  are  doing  more  work  on  coarse  grids,  and  this  keeps 
many  processors  idle.  Again,  the  efficiency  E  decreases  as  we  increase  c,  and  after  a  certain 
entry,  the  algorithm  becomes  asymptotically  inefficient.  For  example,  take  the  two  dimensional 
case  with  n  processors  on  the  n-grid  (d  •  2,  7  «  1)  and  a  «  3.  Going  down  the  appropriate 
column,  we  hare  E(l,3,l,2)  ™  «  1/2,  E(2,3,l,2)  —  2/7  and  E(3,3,l,2)  «■  l/log^n.  Recall  that 
the  larger  c  is,  the  more  accurate  is  the  computed  solution  and  the  more  robust  is  the  orerall 
algorithm.  Thus,  the  c  —  2  design  is  the  most  accurate  efficient  design.  In  general,  for  fixed 
a  and  7,  the  design  just  above  the  efficiency  boundary  is  the  most  accurate  efficient  design.  If 
accuracy  is  no  problem,  then  c  ean  be  chosen  smaller  to  speed  up  the  algorithm. 

Finally,  we  fix  c  and  7  and  rary  a.  Generally,  a  larger  ralue  of  a  means  fewer  processors  are 
needed  to  implement  the  architecture.  It  also  means  that  less  work  has  to  be  done  on  the  coarse 
grids  because  they  hare  fewer  points.  To  see  the  effect  of  rarying  a,  note  that  the  efficiency 
boundary  mores  towards  the  lower  right  hand  corner  of  the  tables  in  Table  4-2  as  a  is  increased. 
This  implies  that,  for  a  fix  architecture  (7)  and  algorithm  (c),  using  a  larger  ralue  of  a  will 
generally  exploit  the  arailable  processors  more  efficiently.  Howerer,  one  cannot  indiscrimatorily 
use  large  ralues  for  a  because  this  leads  to  larger  interpolation  errors  and  less  accurate  solutions. 
For  example,  take  the  two  dimensional  ease  with  n- processors  on  the  n-grid  and  c  —  2.  With  a 
»  2,  the  algorithm  is  asymptotically  inefficient  (E(2,2,l,2)  «  l/log^n)  whereas  with  a  ■»  3  and  a 
■»  4,  it  is  asymptotically  efficient  (E(2,3,l,2)  —  2/7,  E(2,4,l,2)  —  3/7).  Thus,  the  a  -»  3  design 
is  the  most  accurate  efficient  design.  In  general,  for  fixed  c  and  7,  (he  smallest  value  of  a  that 
yields  an  efficient  design  is  the  most  accurate  efficient  design.  If  accuracy  is  no  problem,  then 
a  can  be  chosen  larger  to  speed  up  the  algorithm. 

One  can  carry  out  similar  parametric  studies,  for  example,  by  fixing  one  parameter  and  rarying 
the  other  two.  For  instance,  if  we  are  free  to  choose  both  the  architecture  (7)  and  the  algorithm 
(c),  then  by  inspecting  the  form  of  the  efficiency  constraint  e  <  a*"*,  it  can  be  seen  that  smaller 
ralues  of  c  allow  more  processors  to  be  used  (larger  7)  to  produce  a  faster  efficient  design.  For 
example,  in  the  a  »  2  case,  if  c  «■  2  then  the  p  «  2  entry  gires  the  fastest  efficient  design 
whereas  if  c  *  1,  it  becomes  the  p  «■  1  entry,  with  the  latter  being  faster.  Similarly,  for  a  fixed 
c,  a  larger  ralue  of  a  allows  more  processors  to  be  used  to  achiere  a  faster  efficient  design. 


Table  4-3:  Influence  of  Design  Parameters  on  Optimality  Conditions 

Design  Psra meters 
♦  - ------ - ♦ 

I  e  I  a  I  I 

Optimality  i  Ha*  Accuracy  I  large  I  smell  I  indep  I 

Conditions  ♦ - - — - - - - - - ♦ 

I  Min  T(n)  I  small  I  large  I  large  I 

Constraint  I  Efficiency  E  I  small  I  large  I  small  I 

In  Table  4-3,  we  indicate  the  influence  of  each  of  the  three  design  parameters  {c,a,7}  on  the 
optimality  conditions  and  the  constraint.  For  example,  the  accuracy  of  the  solution  is 
independent  of  7  and  to  achieve  maximum  accuracy,  one  should  take  c  large  and  a  small.  The 
appropriate  choice  of  optimality  condition  depends  on  the  requirements  of  the  given  problem. 
Moreover,  the  general  optimal  design  problem  may  not  have  a  unique  or  bounded  solution  in  the 
three  parameter  space  {c,a,7}.  In  practice,  however,  we  usually  do  not  have  the  freedom  to 
choose  all  three  parameters.  If  the  number  of  free  parameters  are  restricted,  then  the  optimal 
design  problem  may  have  a  unique  solution. 

We  shall  illustrate  this  by  fixing  two  of  the  three  parameters  in  turn  and  study  E  as  a  function 
of  the  free  parameter.  First,  let  us  fix  c  and  a  and  consider  the  effect  of  varying  7.  In  other 
words,  we  consider  the  case  where  the  multi-grid  algorithm  and  the  refinement  of  the  domain  are 
fixed  and  we  are  free  to  choose  the  architecture.  Varying  7  corresponds  to  moving  across  a 
particular  row  of  Table  4-2.  It  is  easy  to  see  that  one  achieves  a  speedup  as  we  use  more 
processors  (i.e.  as  one  moves  from  left  to  right  in  one  of  these  rows).  However,  the  efficiency  E 
generally  goes  down  as  one  uses  more  processors,  and  after  a  certain  entry  the  design  starts  to  be 
asymptotically  inefficient.  For  example,  take  the  three  dimensional  case  (d  <■>  3),  with  a  «  2 
and  c  ■■  2.  With  n  processors  on  the  n-grid,  the  efficiency  is  E(  2,2,1 ,3)  ■■  1/3.  With  n2 
processors  on  n-grid,  we  have  E(2,2,2.3)  *  1/log^n.  and  with  n3  processors  E(2.2,3,3)  «  0(l/a). 
Both  the  last  two  designs  are  asymptotically  inefficient  and  thus  the  7  «  1  design  is  the  fastest 
efficient  design.  In  general,  for  fixed  c  and  a,  the  deeign  jutt  to  the  left  of  the  efficiency 
boundary  to  the  faoteot  efficient  design. 
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Theorem  5:  Assume  7  >  0. 


(1)  If  c  <  ad‘7  then  E(e,a,7,d)  -  (a7-l)(ad-7-c)/(ad-c). 


(2)  If  c  —  ad'7  then  E(c,a,7,d)  —  (a^lja***7  /  ((; 

sd-c)logan). 

(3)  If  c  >  ad*7then 

0<l/(nloV-d+7)) 

/ 

if  c  <  ad, 

E(c,a,7,d)  —  J  0(log#n  /  n7  ) 

if  c  -  ad, 

OU/n7) 

if  c  >  ad. 

Based  on  the  above  theorem,  we  can  immediately  make  the  following  observations: 

1.  A  design  is  asymptotically  efficient  if  and  only  if  c  <  ad'7.  This  inequality 
defines  an  efficiency  boundary  in  the  four  parameter  space  of  {c,a,7,d},  the 
projections  of  which  are  shown  by  *’s  in  Table  4-2. 

2.  The  fully  parallel  design  (1  >■>  d)  is  always  asymptotically  inefficient.  This  follows 
because  to  have  an  efficient  design  in  this  case  requires  c  <  1  which  is  meaningless 
for  the  multi-grid  algorithm. 

3.  Define  a  logarithmically  asympototicaily  efficient  design  to  be  one  with  E  ™ 
0(l/logtn)  as  n  tends  to  infinity.  A  fully  parallel  design  is  logarithmically 
asymptotically  efficient  if  and  only  if  e  «  1.  This  is  case  (2)  in  Theorem  5. 

4.  If  we  start  with  a  non-optimal  design  in  the  one  processor  ease,  then  adding  more 
processors  will  net  make  the  design  asymptotically  efficient.  This  corresponds  to 
the  last  two  cases  in  Case  (3)  of  Theorem  5.  The  reason  is  that  too  many  coarse 
grid  correction  cycles  are  performed  so  that  even  if  more  processors  are  added  to 
speed  np  the  setup  time  for  transferring  to  the  coaser  grids,  too  much  time  is  spent 
on  the  coaser  grids. 

Asymptotically  efficient  designs  are  theoretically  appealing.  They  indicate  that  the  extra 
processors  are  utilized  efficiently  to  achieve  the  speedup.  For  this  reason,  it  is  interesting  to 
consider  the  following  problem: 

Optimal  Design  Problemi 

For  a  given  problem  (i.e.  given  d),  find  the  design  that 

minimizes  T(n)  and/or  maximizes  the  accuracy  of  the  computed  solution 
subject  to  the  constraint  that  it  is  asymptotically  efficient. 


on  one  problem,  the  other  diag onals  on  the  second  problem.  An  array  of  (3m*  - 
2m)  /  4  delay  cells  is  needed;  about  half  as  many  as  in  the  2  *  array  design. 

3.1.  Computing  The  Scale  Factors 

The  computation  (2b)  requires  the  scale  factor  ,a*(d)ft(i>).  that  is  the  dot  pro* 
duct  of  the  solution  of  the  system  (2a)  with  Itself.  Since  the  solution  Vectors  a 
are  produced  one  element  per  cycle  by  successive  array  columns,  we  can  accu* 
muiate  these  dot  products  by  attaching  a  row  of  cells  at  the  bottom  of  the  array; 
see  Figure  6.  for  example. 

4.  Complex  QU  Factorization 

For  signal  processing  applications,  we  must  deal  with  complex  matrices  A.  Of 
course,  one  can  solve  a  complex  nxm  least  squares  problem  by  means  of  a  real 
QU  factorization  of  the  2nx2m  matrix 

U*  -A/ 

•  [Aj  Ah 

where  A  =  Ah  +  iAj  is  the  decomposition  of  A  into  its  real  and  imaginary  parts. 
But  when  Givens  rotations  are  used,  this  factorization  requires  6/3  times  as 
many  real  multiplications  as  a  direct  complex  QU  factorization  of  A.  We  shall 
now  chscuss  how  this  can  be  done. 

The  QU  factorization  is  unique  only  up  to  scaling  of  the  rows  of  U  by  factors 
of  unit  modulus:  A  =  {QD)  (D~l  U)  is  also  a  QU  factorization  for  any  unitary  diag¬ 
onal  matrix  D.  Thus,  we  may  require  that  the  diagonal  elements  of  U  be  positive 
real.  This  is  the  (unique)  factorization  computed  by  our  array. 

Let  us  change  the  cell  definitions  of  Figure  1  to  those  of  Figure  9.  (We  shall 
now  employ  this  convention:  lower  case  Roman  letters  are  complex.  Greek  are 
real,  and,  for  example, 


«  *  a  ♦  ja 

where  j  *  V-T.  These  cells  are,  of  course,  more  difficult  to  implement  the 
real  Givens  cell*. 

With  this  there  is  little  else  that  changes.  Now,  when  a  leading  element  of 
an  input  matrix  hita  a  boundary  cell  the  effect  is  this 


a 


Thus,  instead  o ( the  identity  produced  in  the  real  case,  a  unitary  diagonal  rota¬ 
tion  that  simply  rescales  a  row  is  produced. 

6.  VLSI  Implementation 

In  this  section  we  shall  discuss  how  the  internal  cells  of  the  real  and  complex  QU 
arrays  and  the  complex  backsolve  array  might  be  fabricated  using  a  Systolic 
Internal  Chip  (SIC)  that  is  being  developed  at  ESL.  It  is  particularly  noteworthy 
that  these  different  cells  can  be  obtained  using  a  common  VLSI  building  block, 
without  the  use  of  additional  chips.  Moreover,  we  have  used  the  SIC  to  design 
other  compound  cells  (for  real  and  complex  LU  factorization  of  dense  and  band 
matrices  and  for  band  QU  factorization).  We  expect  that  systolic  arrays  for 
most  of  the  standard  algorithms  of  numerical  linear  /algebra  can  be  generated 
using  this  chip  and  one  other  that  we  will  also  mention./ 

First  we  shall  give  a  rough  description  of  the  SIC.  It  is.  being  designed  in  a 
TRW  2/u  CMOS  technology.  Its  f —notion  is  this 


U  LJ. 

out  =  in 

Y  V. 

out  *  on  * 

W  W  —  U  V 

out  «  in  (s)  in  in 


where@e  [+  .  -  .  t  .  +  -  j  denotes  the  sign  used  in  the  addition  This  can  be 
*  .  *.  or  can  alternate  between  the  two  Operands  are  floating  point.  Substan¬ 
tial  effort  hat  *on»  into  providing  awttohing  funetlona  and  internal  registers  to 
Increase  the  SIC's  flexibility. 

Compound  cells  for  complex  arithmetic  can  be  built.  Here  are  two  of 
Kung'e  designs:  a  complex  c  +  ab  cell  using  2  SICs, 


S  B 
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(  C  2  Y  +  j  y '  , 
a  2  a  ♦  j  a' , 
b  2  S  +  j  8'.  ) 


and  another,  using  4  chips,  in  which  one  operand  is  stationery. 


y' 

Y 


This  is  the  internal  cell  for  the  complex  backsolve  array. 

Figure  10  gives  a  layout  for  a  complex  Givens  cell  that  uses  6  SICs.  There 
are  two  two-chip  complex  multiply-add  cells,  one  for  c'z  and  another  for  cy  , 
and  two  one-chip  cells  for  real  •  complex  multiply-add,  one  for  cz  and  another 
for  try  ,  The  complex  quantities  are  represented  using  one  of  the  formats  dis¬ 
cussed  above.  New  operands  can  enter  the  cell  every  second  clock.  There  is  a 
three  clock  delay  on  the  z-t  path,  but  only  a  1  clock  delay  on  the  cw— and 
a**  -  paths. 

We  lack  the  space  to  fully  discuss  the  boundary  cell's  implementation. 
Another  chip  is  necessary.  A  chip  using  either  faster  gates  or  more  internal 
parallelism  could  provide  divide,  square  root,  and  reciprocal  square  root  opera¬ 
tions  at  a  rate  of  one  operation  per  SIC  cycle.  This  second  chip  could,  with  the 
51C.  be  used  to  design  a  pipelined  boundary  cell  for  QU .  backsoive,  or  LU  opera¬ 
tions  with  enough  throughput  to  keep  pace  with  the  array.  The  boundary  cells 
will  usually  have  more  latency  than  the  internal  cells. 

For  a  trapezoidal  QU  array  this  extra  boundary  cell  latency  throws  the 
array  out  of  synchronization  unless  we  provide  appropriate  tntercell  delays. 
Fortunately  these  delays  are  needed  only  in  the  left-hand  triangular  part  of  the 
array.  Figure  11  gives  an  example.  Here  the  boundary  cell's  latency  is  S  and 
the  internal  cell's  is  4,  although  its  left-right  latency  is  just  1.  The  time  an 
operand  first  arrives  is  shown  In  each  cell.  The  diamonds  are  delays;  they  delay 
the  data  4  cycles.  Their  effect  Is  to  equalize  the  latency  of  alternate  paths 
through  the  array.  A  pxq  array  requires  -  l)  /Z  delay  cells.  These  delays 
have  an  effect  on  the  latency  of  the  array  and  on  the  pattern  of  input  output 
(it  Is  not  a  parallelogram  any  more).  But  there  is  no  reduction  in  throughput. 
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1 .  Summary 

Every  symmetric  positive  definite  n*n  matrix  A  admits  both  the  Cholesky 
factorization 

(1.1)  A  -  RTR 

where  R  is  upper  triangular  and  the  related  "square-root-free"  factorization 

(1.2)  A  -  LDLT 


where  L  is  unit-lower  triangular  and  D  is  diagonal  with  positive  elements. 


In  a  number  of  situations,  it  is  necessary  to  factor  the  modified  matrix 

T 

(1.3)  A  =  A  ♦  azz  . 

Gill,  Golub,  Murray, and  Saunders  (hereafter  GGMS)  give  several  algorithms 
for  modifying  the  factors  of  A  [2).  The  algorithms  all  require  Cn2  +  i2(n) 
operations  for  some  constant  C;  direct  Cholesky  factorization  requires 
( 1  /6)n 3  + 'fXn2 )  operations,  where  an  operation  is  one  multiplication  and 
one  addition.  The  purpose  of  this  paper  is  to  describe  the  implementation 
of  some  of  these  algorithms  by  systolic  array. 


The  principle  application  for  systolic  computation  of  updated  Cholesky 
factors  comes  from  digital  signal  processing.  There,  A  is  an  estimate  of 
the  covariance  matrix  of  a  random  signal  vector.  Periodically  it  is 
updated  according  to  (1.3);  in  these  applications  a  is  ordinarily  positive. 


In  this  context  one  may  also  have  to  consider  rank  k  modifications 


(1.4) 


We  discuss  an  appropriate  systolic  method  for  this  situation  in  Section  3. 


1.1  Notation .  The  elements  of  a  matrix  A  and  a  vector  x  shall  be  denoted 
hy  a^j  and  x^.  The  symbols  R,  L  and  D  shall  be  used  for  upper  triangular, 
unit  lower  triangular  and  diagonal  matrices,  respectively. 

D  -  diag(di,  ...  .d^).  We  write  for  the  iC^  column  of  the  identity  matrix. 

We  shall  use  the  notation  P^J  for  a  plane  rotation  matrix  that  differs 
from  the  identity  matrix  only  in  that 


P  •  •  “  P  •  • 
“  JJ 


Pij 


sinQ.  Given  i,  j  and  x  there  exists  9  such  that 


where  c  ■  cosQ,  s 


T 


(piH- 


\k  .  k  ♦  i.j 

0  ,  k  -  i 

(x?  k  -  j 


2.  Methods  using  the  Cholesky  factorization  of  D  +  app 

Let  us  determine  the  factors 


A  *  LDL 


By  (1.2), 


_  j  T  T 

A  *  A+azz  =  L(D+app  )L 


where  Lp=z.  First,  find  p  by  back-substitution.  If  we  find  the  factorization 

(2.1) 


T  **  r**  ** r r 

D  +  app  “LDL 


Then  the  modified  factorization  of 


—  T  T 

A “ LLDL  L 


is  given  by 

(2.2)  L  “  L L,  5  =  D. 

We  consider  two  algorithms  of  GGMS  for  computing  (2.1)  -(2.2),  The  first 
is  readily  implemented  by  systolic  array.  It  may  fail,  however,  when  a<0 
and  A  is  ill-conditioned.  The  second  is  infallible,  but  not  easily 
implemented. 

GGMS  show  that,  in  (2.1), 


(2.3) 


8.  •  p  0  ,  1  <  s  <  r  <  n 

rs  r  s 


Thus,  to  get  the  factorization  (2.1),  we  need  only  find  8.,  d.,  1  < j  <n. 

,  3  —  3 

The  special  structure  of  L  also  allows  fast  computation  of  L*LL  .  Define 

the  vectors 


.  W^J)  '  J.  Sri  Pi  “  zr  •  Si  pi'  +  1 . n- 

i-j  i-l 


These  values  are  generated  in  the  course  of  back-substitution  to  find  p. 
When  computing  L  they  are  used  again.  By  (2.2),  (2.3) 


- 1- 


The  final  recurrence  for  computing  L  and  D  is  this: 
Algorithm  2. 1 

1 .  Let  ai  ■  a;  v/  1 ^  *  z. 

2.  for  j  «  1.2,  ...  , n ,  compute 


(2.4) 

(j) 

D  .  =  W  . 

J  J 

(2.5) 

d  .  =  d  .  ♦  a  .  p( 

J  J  J  J 

(2.6) 

8.  =  p.a./d. 

J  J  J  J 

(2.7) 

a.  ,  =  d.a./d. 

J+1  J  J  J 

(2.8) 

W(j  +  D  _  „(j)  _  p.£  . 
r  r  rj 

(2.9) 

-  ( i+1 ) 

rj  r j  j  r 

We  now  consider  systolic  array  methods.  First,  there  is  an  obvious  method, 
shown  in  Figure  1,  in  which  there  is  a  processor  for  each  stage  1  < j  <n 
in  the  algorithm.  The  cell  used  is  this 


first  clock  (r  =  j) : 


r 


(see  (2-4)  -  (2.7)); 


subsequent  clocks  (r>j): 


r 

6. 

J 


l  . 

rJ 


L 


t  . 

rj 

> 


6. 

J 


1 


-J 


l. 


(see  (2.8),  (2.9)). 


wt 

- 

a? 

->  -  »  ■ 

.  . - » 

j 

*  - 

T 

1  ♦ 

*  ^ 

(1) 

W1 

d, 

( 

1  1 

i  i 

v<’> 

L 

j 

(2) 

s2 

A 

21 

L. 

a2 

W(1) 

w3 

*31 

(2) 

w 

3 

*32 

w<3>  d, 

} 

w(1) 

n 

en1 

(2) 

w 

n 

*n  2 

• 

(n)  . 

.  .  w  d 

n  n 

Figure  1.  An  array  for  Cholesky  update 


The  principle  disadvantage  of  this  design  is  that  the  cells  are  relatively 
complicated.  All  must  divide.  All  have  memory. 

There  is  a  second  array,  shown  in  Figure  2,  without  these  disadvangages 


The  cells  are  defined  as  follows 


(see  (2.8)); 


d. 

J 


(see  (2.4),  (2.5),  (2.7)); 


-h- 


Algorichm  2.1  may  fail.  GUMS  point  out  that  when  a<U  and  A  is  nearly 
singular,  rounding  error  can  cause  one  of  the  computed  values  dj  to  be 
nonpositive,  so  that  an  indefinite  factorization  is  obtained. 

GGMS  have  proposed  a  modified  algorithm,  algorithm  2.2  below,  that 
is  guaranteed  to  provide  a  positive  definite  factorization. 

Algorithm  2.2 

1 .  Solve  LP  =  z 


2.  Define 


ai  =  a 


si 


l  p?/d. 
j-1  J  J 


ii 


oi  =  a/[  1  ♦  ( 1  +  asi) J] . 
3.  for  j  ■  1,2,  ...  ,n,  compute 
(a)  q. 


(b)  0 

(c)  s 

J 

(d) 

(e)  d 

(f)  8 

(g)  a 


■M 


“  p./d. 

J  J 

=  1  +  o .q  . 
J  J 

=  s  .  -  q  . 

J  J 


0:  +  a2.  q  .  s  . 

J  J  J  J+1 


p>  d. 

J  J 

a  .p  .  /d  . 
J  J  J 

J  J 


(h)  o. 

*  0.(1 

♦  p  .  )  / 1  P  .  (  6  ■  « 

J*' 

J 

J  J  J 

(i)  for 

r  *  j  +  1 , 

j*2,  ...  ,n. 

.  w^-p.l  . 

r 

r  *j  rj 

(2. 

,  1G) 

l  .  *  e. 

♦  3 

rJ 

r  j  J  r 

Thi 

ls  algori 

thm  doe 

s  not  permit 

an 

array,  a 

s  does 

Algorithm  2. 1 

step  3  can 

begin , 

and  the  full  1 

i  n 

order  to 

compute  si.  There  i 

in 

whii.h  p, 

w.  q. 

and  si  are  coi 

thi 

rough  the 

array , 

and  all  othe 

boundary  cell  computation  for  the  second  pass  is  quite  complex,  as 
it  encompasses  steps  3(b)  -3(h).  In  practice,  it  will  be  rather 
difficult  to  match  the  speed  of  such  a  boundary  cell  to  chat  of  the 
rest  of  the  array,  which  would  consist  of  a  linear  array,  each  cell 


performing  the  computation  (2.10)  for  a  fixed  value  of  r-j,  exactly 
as  in  the  lower  half  of  the  array  in  Figure  2. 


3.  Methods  based  on  plane  rotations 

GGMS  give  several  algorithms  using  plane  rotations  for  modifying 
Cholesky  factorizations.  We  have  found  two  of  the  more  efficient  of 
these  to  be  easily  implementable  as  systolic  arrays.  The  first  is 
simple,  efficient,  requires  only  one  pass  of  the  data  through  the 
array,  but  works  only  when  a >  0.  The  second  works  in  general,  but 
requires  two  passes.  Both  can  be  extended  to  handle  rank-k  updates 
(1.4). 


3.1  Positive  rank  -  1  changes 

We  compute  the  modified  factorization 

RTR  *  RTR  +  azzT,  a>0, 
i  T 

by  reducing  the  matrix  [a2z  s  R  ]  to  lower  triangular  form, 
[a*z  •  RT]p  -  [rT  :  o] 

where  P  =  P21  P32  •••  P  ?.  It  follows  that 

n+i 


rtr  -,v 


T 

az  z  , 


so  that  R  is  the  updated  Cholesky  factor. 

The  method  can  be  readily  extended  to  handle  rank-k  updating  efficiently. 
In  this  case 


(3.  1) 


A  *  A  +  MZ 


where 

Z  -  [zr  z2,  ...  ,zk] 
is  an  n*k  matrix  and 


diag(ar  o2>  ...  ,  c^), 


When  cu  >  0,  1  < j  < k  then  we  can  form  the  matrix 


A>  ZT 


and  reduce  it  to  upper  triangular  form  by  premultiplication  by  a  suitable 


sequence 


-8- 


p  .  p(«)  p^-0  <t<  p(D 


D(j)  .  p  J  P  p  k*j"' 

P  Pj>1  Pj+2  *•*  Vj 


of  plane  rotations. 

The  advantage  of  considering  k  updates  together  instead  of  as  a 
sequence  of  k  rank-1  updates  is  that  the  modification  of  R  can  be 
carried  out  by  one  pass  of  the  data  through  a  systolic  array  of  0(kn) 
processors.  Thus,  while  the  number  of  arithmetic  operations  is  the 
same,  the  total  time  and  the  number  of  I/O  operations  (including 
memory  references)  is  less  by  the  factor  k. 

We  postpone  consideration  of  the  array  until  the  next  section. 


3.2  Negative  rank-1  changes 

CGMS  have  given  a  modification  of  the  previous  method  for  the  case 
a<0.  We  shall  now  generalize  their  approach  to  the  case  of  a  rank-k 
change.  Let 

(3.2)  A  -  A  -  MZT  *  RTR-Z,52T 

where  Z  is  n*k  and  A  >  0  is  diagonal,  for  convenience  we  take  A-  I, 
without  loss  of  generality.  Now  let  P  be  the  n*k  solution  to 

(3.3)  RTP  -  Z. 

T 

It  is  easy  to  show  that  A  remains  positive  definite  if  and  only  if  I-P  P 
is  positive  definite.  Now  form  the  matrix 

L'.i.'i 

d  :  o  : 

L  n  .  j 

(D  is  a  lover  criangular  matrix  to  be  specified  below)  and  premultiply 
n  (f)(2)  (W) 

by  an  orthogonal  matrix  Q  of  the  form  Q  ■  Pv  Pv  ...  P'  where 

P^  *  P  .*  . . .  P  n  zeros  column  j  of  P  and  leaves  R  in  upper-triangular 
n+j  n+j  J  vv  ° 

form,  until 


r p  :  Rj 

(3.4)  q  i  — :  — I 

tDn:  oj 

From  (3.4)  we  conclude  that 


o  :  R  ; 


We  first  consider  the  method  of  section  3.1  for  positive  updates. 

The  array  presented  is  a  generalization  of  an  array  of  Heller  and 
Ipsen  for  QR  factorization  of  a  banded  matrix  [3].  A  similar  notion 
also  appears  in  Schreiber's  work  on  systolic  arrays  for  the  eigenvalues 
of  a  nonsytmnetr ic  matrix  [5]. 


The  array  is  shown  in  Figure  3.  It  is  k  *  (n+k) ,  rectangularly 
connected.  Not  all  of  the  k(n+k)  cells  actually  do  anything,  as  is 
shown  in  the  figure.  If  the  first  elements  of  R  and  Z  enter  at  time  1, 
then  the  first  element  of  R,  m,  emerges  at  time  2k+1,  and  the  last, 


r  at  time  2(n+k)  -  1 . 
nn 


Again,  for  convenience,  we  assume  a  el.  The  cells  of  this  array 
operate  as  follows. 


boundary  cell : 


s  ■  sin  9 ,  and 


(o')  -  (4 :)  d) . 


/V.  .v 


_  *,  *,  ■»  *»  *• ^  ,  ►  , '  *  *  .  »'*,***,«  *  •  1  *  * »  *  *  •  *  » *  ^  *  i 

•.  .%  . » -v  ^  »  -  - •’* 


y 

where  c  *  cos  9, 


internal  call; 


C  .8 


where 


t  9  «• 


L 


ftO  •  (4 :)  ft)  • 


delay  cell : 


r 


nn 


Figure  3.  The  Heller-Ipaen  array  for  a>0,  k*3 


lc  would  be  quite  possible  to  eliminate  the  delay  cells  and  provide 
input  data  directly  at  the  right-hand  edge  ot  the  remaining  k  *  n  array 
The  pattern  of  access  to  the  data  is  slightly  more  complex  when  this 
is  done. 

The  second  algorithm,  for  a<G,  requires  two  passes  over  the 

matrix  R.  First,  the  systems  (3.3)  are  solved  using  a  kxn  array, 

an  obvious  generalization  of  Rung  and  Leiserson's  array  for  k  =  1  and 

banded  R  [ 4 ] .  Next,  we  compute  D  off-line.  We  assume  that  k  is  quite 

n  T 

small,  so  that  it  is  easy  to  do  this.  The  matrix  I-P  P,  of  which 
is  the  Cholesky  factor,  can  be  accumulated  by  another,  rather  obvious, 
array  of  k(k+1)/2  cells. 

Now  we  consider  the  reduction  (3.4).  This  is  done  by  the  array 
shown  in  Figure  4. 

The  matrices  D^,  P,  and  R  enter  in  the  format  shown,  with  D  =  td^j]. 

The  last  element,  r.  .,  enters  2n  ♦  k  -  1  clocks  after  the  first,  and 

*  *  ■ 

leaves  from  the  top  k  clocks  later.  The  matrices  D0  =  I  and  S  reside  in 
the  array.  The  cells  used  are  these: 


boundary  cell: 


if  empty: 


where  c  =  cos  0,  s  =  sin  6,  and 

( o)  ■  c  (;) . 

(there  is  really  nothing  special  about  the  "empty"  case] 


internal  cell: 


x 
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c  ,S 


X 
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Figure  4,  Updating  with  plane  rotations  when  a<0 
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Initially  all  cells  contain  zero  and  all  rotations  are  c-0,  s-1. 

In  many  respects  this  array  is  like  a  k  *n  section  of  the  Centleraan- 
Kung  array  for  QR  factorization  [l],  which  was  described  more  fully  by 
Schreiber  and  Kuekes  [6]. 


4 .  Related  problems 

The  problem  of  updating  Cholesky  factors  frequently  arises  in  contexts 


in  which 


A  =  BTB 


for  a  rectangular  matrix  B  of  n  columns.  In  the  orthogonal  factorization 

B  =  QR 

T 

the  matrix  R  is  the  Cholesky  factor  of  A.  If  a  row  z  is  appended  to  B 
this  causes  a  rank-cne  change  to  A.  To  update  R,  any  of  the  methods 
discussed  here  can  be  used.  The  first  method  of  Section  3  is  particularly 
appropriate  since  it  uses  orthogonal  operations  to  update  R.  If  Q  is  to 
be  updated,  too,  one  computes 


-<  r  ■> 
Q  :  0  B 


o  :  i 


,  f  R  V51 

1  *T  ;  lo. 


so  that 


Q  -  Q, 


Q  :  0  ! 
.  0  :  1 J 


The  array  used  can  be  enlarged  to  allow  this  computation  to  be  performed 
also. 
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I.  INTRODUCTION 

In  this  paper  we  shall  summarize  the  recent  development 
of  systolic  array  methods  for  some  of  the  important  standard 
problems  in  numerical  linear  algebra.  We  shall  discuss  LU  and 
QR  factorizations,  eigenvalue  problems,  and  the  singular  value 
decomposition.  All  the  work  we  shall  describe  has  been  done 
since  1981.  Our  aim  is  to  introduce  the  reader  to  this 
rapidly  developing  branch  of  numerical  computation. 

Systolic  arrays  were  introduced  and  named  by  Rung  and 
Leiserson  [12]  .  Although  their  ideas  had  antecedents  (see 
[ 8 J  for  example),  they  first  fully  realized  and  proclaimed 
that  these  designs — highly  parallel  computing  networks  with 
regular  data  flows,  two-dimensional  lattice  form,  simple  iden¬ 
tical  cells,  regular  input  and  output  patterns,  and  heavy  re¬ 
use  of  data--were  an  ideal  way  to  use  the  emerging  VLSI 
technology  to  obtain  very  high  performance  for  suitable 
computations . 

After  these  first  promising  designs,  which  implemented  a 
few  relatively  simple  matrix  computations,  two  important  ques¬ 
tions  were  open.  Could  systolic  arrays  be  found  for  a  broader 
class  of  matrix  problems,  including  the  more  important  and 
difficult  matrix  decompositions?  And  could  they  be  integrated 
into  real  computing  systems  without  introducing  incapacitating 
losses  of  efficiency? 
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The  first  of  these  questions  has  now  been  answered  and 
the  answer  is  a  definite  "yes."  This  research  has  done  more 
than  answer  the  question  of  the  applicability  of  the  systolic 
array  rdea.  It  is,  rather,  the  beginning  of  an  intriguing  new 
approach  to  numerical  computation.  In  this  new  approach  the 
flow  of  data  in  both  time  and  space  is  important,  whereas  tra¬ 
ditionally  only  the  sequence  of  computations  in  time  was  con¬ 
sidered.  An  interesting  facet  is  that  the  standard  algorithms 
have  not  always  proven  suitable  for  systolic  array  realization. 
For  instance,  for  symmetric  eigenvalue  problems,  Brent  and  Luk 
1  [>]  have  found  that  Jacobi's  method  is  better  than  QR.  Their 
method  uses  a  permutation  of  the  off-diagonal  elements  that 
seems  to  be  better  than  the  usual  cyclic-by-rows  permutation 
[5]  . 

The  second  question  is  being  answered  now.  Several  pro¬ 
jects  for  building  VLSI  based  systolic  hardware  are  currently 
m  progress  or  complete  ([1],  [19]}.  An  effort  at  ESL,  Inc., 

should  produce  a  systolic  machine  with  performance  in  the  100- 
1000  megaflop  range  in  the  next  few  years. 

It  appears  at  the  moment  that  digital  signal  and  image 
processing  [15] ,  rather  than  standard  scientific  computing  and 
elliptic  equation  solving  in  particular,  offers  the  sort  of 
problem  best  suited  for  systolic  array  solution.  Nevertheless 
thuse  ideas  can  be  usefully  applied  to  elliptic  equations.  In 
Section  VI  we  shall  discuss  an  implementation  of  multigrid 
methods  by  highly  parallel  computing  networks. 

II.  LU  AND  QR  FACTORIZATIONS 

The  first  such  array  was  an  m  * m  hexagonally  connected 
array  for  LU  factorization  of  n  *  n  banded  matrices  with  band¬ 
width  m  in  time  0(3n)  [12].  Unfortunately,  there  was  no  pro¬ 

vision  for  pivoting  to  enhance  stability.  Next,  an  array  for 
QR  or  LU  factorization  of  a  dense  square  matrix,  in  linear 
time,  was  given  by  Bojanczyk,  Brent,  and  Kung  [ 2 ] .  A  differ¬ 
ent  array,  better  in  that  it  could  handle  rectangular  m  * n 
matrices  in  time  0(m),  was  given  by  Gentleman  and  Kung  [10]. 
The  unit  of  time  we  use  is  the  cycle  time  of  a  cell  of  the 
given  array.  In  this  time,  every  array  accepts  one  set  of 
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inputs  and  produces  one  set  of  outputs.  Finally,  Heller  and 
ipseii  developed  a  rectangular  array  for  band  matrix  LU  and  QR 
factorization  [11].  (See  Figures  1  and  2.)  Many  practical 
details  concerning  the  implementation  of  the  Gentleman-Kung 
array  were  discussed  by  Schreiber  and  Kuekes;  they  also  give 
an  array  for  triangular  systems  with  many  right-hand  sides 
’  •  5]  . 


Figure  1. 

The  Gentleman-Kung  Array 


Figure  2 . 

The  Heller-Ipsen  Array 


The  QR  factorization  arrays  employ  Givens  rather  than 
Householder  transformations.  The  LU  arrays  employ  a  stable 
variant  of  Gaussian  elimination  that  uses  "neighbor  pivoting." 
When  the  matrix  element  a^  is  eliminated,  the  elimination  is 
done  by  subtracting  from  row  i  a  multiple  of  row  i-1.  These 
rows  are  first  exchanged,  if  necessary,  to  make  the  multiply¬ 
ing  factor  smaller  in  modulus  than  one.  Partial  pivoting  does 
not  lend  iteself  to  systolic  implementation. 

III.  EIGENVALUE  COMPUTATION 

The  obvious  first  approach  was  to  attempt  to  implement 
the  standard  algorithm  for  the  symmetric  problem:  reduction 
to  tridiagonal  form  by  orthogonal  similarity  transformation 
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followed  by  a  method,  QR  iteration  for  example,  for  the  tri- 

2 

diagonal  matrix.  A  linear  array  that  requires  0(n  )  time  is 

easy  to  find,  but  more  parallelism  has  so  far  proved  elusive. 

Schreiber  showed,  however,  that  reduction  to  a  matrix  with  2k+l 

2 

non-zero  diagonals  could  be  done  in  n  /k  cycle  times  using  a 
k  xn  sub-array  of  the  Gentleman-Kung  array.  The  eigenvalues 
of  the  banded  matrix  can  be  found  using  QR  iteration  which  was 
shown  to  be  implementable  by  the  Heller- Ipsen  array  in  only  a 
little  more  time  than  is  needed  for  QR  factorization  [13] .  If 
further  speed  is  desired,  then  up  to  n/k  QR  iterations  can  be 

performed  simultaneously  by  a  pipeline  of  Heller-Ipsen  arrays. 

1/2  3/2 

Choosing  k  =  n  /  yields  a  method  using  0(n  )  processors  and 

3/2 

o(n  '  )  time.  (There  is  some  difficulty  concerning  the  choice 

of  shifts  in  the  QR  iteration  when  iterations  are  pipelined.) 
These  techniques  are  also  applicable  to  nonsymmetric  matrices 
[14]  . 

The  pursuit  of  a  linear-time  solution  led  Brent  and  Luk 
to  consider  Jacobi  methods.  They  have  found  an  implementation 
using  an  n/2  xn/2  array,  which  implements  a  cyclic  Jacobi 
sweep  in  n-1  cycle  times  [5].  Their  experiments  have  shown 
that  O(log  n)  sweeps  are  required  for  convergence.  In  prac¬ 
tice,  for  n  <_  1000,  no  more  than  10  sweeps  would  be  required 
(5]  . 


IV.  SINGULAR  VALUE  DECOMPOSITION  (SVD) 

To  both  the  approaches  of  Schreiber  and  Brent-Luk  there 
are  analogous  approaches  to  the  SVD.  Schreiber's  takes  mn/k 
cycle  times  to  reduce  an  m  x  n  matrix  to  an  upper  triangular 
matrix  with  k+1  nonzero  diagonals.  QR  iteration  can  then  be 
carried  out  using  Heller-Ipsen  arrays  [16],  Brent  and  Luk 
employ  a  linear  array,  and  their  method  requires  O(mnlogn) 
time  14],  Brent,  Luk,  and  Van  Loan  have  just  reported  an  im¬ 
proved  method  requiring  almost  linear  time  [6] . 

One  of  the  uses  of  SVD  is  in  solving  ill-conditioned 
least-squares  problems.  An  alternative  method  is  to  construct 
the  generalized  inverse  of  a  rank-deficient  matrix  closest  to 
the  given  matrix  by  an  iterative  method.  An  algorithm  of  Ben- 
Israel,  improved  by  Schreiber,  can  be  used.  The  generalized 


inverse  is  obtained  in  0(m  log  cond(A) )  time  using  n^  proces¬ 
sors  1 1  8  J  . 

V.  DECOMPOSING  ARRAYS 

A  most  important  question  from  the  practical  point  of 
view  is  this:  can  a  physical  array  of  fixed  size  (number  of 
cells)  be  used  to  solve  problems  of  arbitrary  size?  The  ques¬ 
tion  is  important  because,  obviously,  problems  of  widely  dif¬ 
fering  sizes  may  be  confronted  in  some  applications.  Not  so 
obviously,  in  some  applications  the  problems  may  have  only  one 
size,  but  they  may  occur  infrequently  enough  that  a  full-sized 
array  would  not  be  kept  busy  for  more  than  a  fraction  of  the 
time.  In  these  cases  it  would  be  more  efficient  to  build  a 
smaller  array  and  have  it  simulate  a  full-sized  array  of  the 
same  kind.  The  question  is  whether  this  hardware/time  trade¬ 
off  is  possible. 

For  some  arrays  it  is  easy  to  see  how  to  do  this.  Con¬ 
sider  the  Gentleman-Kung  array  (Figure  1) .  Suppose  a  sub¬ 
array  of  the  size  shown  within  the  dashed  outlines  is  available. 
I he  sub-array  could  be  used  to  perform  the  work  done  by  the 
outlined  part  of  the  whole  array  because  all  the  data  flowing 
into  that  part  of  the  array  is  known  at  the  outset.  Data  flow¬ 
ing  out  of  the  subarray  has  to  be  stored.  The  sub-array  can 
next  be  moved  to  cover  another  part  of  the  large  array  for 
which  all  inputs  are  now  known.  This  continues  until  the 
whole  array  has  been  covered. 

For  some  arrays,  the  only  sub-array  for  which  all  inputs 
are  known  is  the  entire  array.  The  Kung-Leisserson  LU  factor¬ 
ization  array  is  an  example.  Such  an  array  is  not  "decomposa¬ 
ble"  by  this  technique.  We  consider  that  this  is  a  serious 
drawback . 

The  Heller-Ipsen  array  (Figure  2)  is  partially  decomposa¬ 
ble.  If  we  cut  it  by  a  horizontal  line  (the  dashed  line  in 
Figure  2)  then,  since  all  data  flows  through  the  cut  from 
bottom  to  top,  we  can  "run"  the  bottom  part  of  the  array,  save 
the  output,  then  run  the  top  part  using  this  saved  data  as  in¬ 
put.  But  if  a  vertical  cut  is  made,  data  flows  across  the  cut 
in  both  directions.  Moreover,  these  data  cannot  be  known 
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unless  the  computations  performed  by  both  the  left  and  right 
halves  of  the  array  are  done. 

The  Brent-Luk  SVD  array  at  first  appears  to  be  indecom¬ 
posable.  Despite  this  we  have  recently  found  a  way  to  use 
this  array  to  solve  larger  problems.  The  key  is  to  use  a  p- 
processor  Brent-Luk  array  as  the  basic  "cell"  in  a  q-processor 
"super-array"  that  works  on  matrices  with  2pq  columns.  Colunns 
move  between  these  cells  in  groups  of  p.  The  idea  also  applies 
to  the  Brent-Luk  eigenvalue  array.  We  shall  give  the  details 
in  a  later  paper  [19]. 

VI .  MULTI GRID  METHODS 

To  illustrate  the  applicability  of  systolic-array-like 
devices  for  elliptic  problems,  we  consider  implementation  of 
multigrid  algorithms.  Full  details  are  given  by  Chan  and 
Schreiber  [7] .  Related  work  is  reported  by  Brandt  [3]  and 
Gannon  and  Van  Rosendale  [9],  Consider  a  standard  multigrid 

j 

algorithm  with  n  gridpoints  on  the  finest  grid  and  a  coarse- 
to-fine  mesh-length  ratio  of  a,  in  which  c  coarse  grid  intera- 
tions  are  done  for  every  fine  grid  iteration.  Suppose  we 
build  a  processor  grid  with  np  processors  for  every  point  grid 
..’ith  n  gridpoints,  so  that  all  computation  on  the  fine  grid 
take  0(nd  p)  time.  Then  we  can  show  that  the  time  for  an 
iteration  T(n),  satisfies 


0(nd~p) 

if 

c 

<  ad-p 

0  (nd  p  log  n) 

if 

c 

=  ad"P 

0(n^°^a  c) 

if 

c 

>  ad-p 

Thus,  we  get  a  speedup  proportional  to  the  number  of  process¬ 
ors  employed  only  in  the  first  case.  The  loss  of  efficiency 
is  not  too  bad  (O(l/log  n) )  in  the  second  case,  as  for  in¬ 
stance  when  there  is  one  processor  for  every  gridpoint  (p  =  d) 
and  we  take  c  =  1. 
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ABSTRACT 

Systolic  architectures  due  to  Brent,  Luk,  and  Van  Loan  are  today  the  most 
promising  method  for  computing  the  symmetric  eigenvalue  and  singular  value 
decompositions  in  real  time.  These  systolic  arrays,  however,  are  only  able  to  solve 
problems  of  a  given,  fixed  size.  Here  we  present  two  modified  algorithms  and  a 
modified  array  that  do  not  have  this  disadvantage.  The  results  of  a  numerical 
experiment  show  that  one  combination  of  new  algorithm  and  array  is  just  as 
efficient  as  the  Brent-Luk-Van  Loan  method 


1.  INTRODUCTION 

Systolic  arrays  are  of  significant  and  growing  importance  in  numerical  computing 
jllj,  especially  in  matrix  computation  and  its  applications  in  digital  signal 
processing  I  12  J.  There  is  now  considerable  interest  in  systolic  computation  of  the 
singular  value  decomposition  (2,4,6,  10 J  and  the  symmetric  eigenvalue  problem 

i  1.9!- 

To  date,  the  most  powerful  systolic  array  for  the  eigenvalues  of  a  symmetric 
n  X  n  matrix  is  a  square  n/2  X  nj 2  array  due  to  Brent  and  Luk.  This  array 
implements  a  certain  cyclic  Jacobi  method.  It  takes  O(n)  time  to  perform  a  sweep 
of  the  method,  and  O(logn)  sweeps  for  the  method  to  converge  [  1  j 

Brent  and  Luk  have  also  invented  a  closely  related  (n/2)-processor  linear 
array  for  computing  the  singular  value  decomposition  (SVD)  of  an  m  X  n  matrix 
A.  SVD  of  A  is  a  factorization  A  =  ULVT ,  where  V  is  orthogonal,  E  is 
nonnegative  and  diagonal,  and  U  is  m  X  n  with  orthonormal  columns.  This 
array  implements  a  cyclic  Hestenes  algorithm  that,  in  real  arithmetic,  is  an  exact 
analogue  of  their  Jacobi  method  applied  to  the  eigenproblem  for  Ar A.  The  array 
requires  O(mn)  time  for  a  sweep,  and  O(logn)  sweeps  for  convergence  ( 2  j . 

A  new  array,  very  like  the  eigenvalue  array,  is  reported  by  Brent,  Luk,  and 
Van  Loan  to  be  capable  of  finding  the  SVD  in  time  0\m  +  n  logn)  j  3). 

The  purpose  of  this  paper  is  to  consider  an  important  indeed  an  essential 
problem  concerning  the  practical  use  of  these  arrays  How,  with  an  array  of  a 
given  fixed  size,  can  we  solve  problems  of  arbitrarily  large  size’ 


2.  THE  JACOBI  AND  HESTENES  METHODS  AND  THE  BRENT-LUK  ARRAYS 


We  shall  concentrate  on  Hestenes’  method  for  the  SVD.  Starting  with  the  given 
matrix  A,  we  build  an  orthogonal  matrix  V  such  that  AV  has  orthogonal  columns 
Thus 


AV  =  U  E 


where  U  has  orthonormal  columns  and  E  is  nonnegative  and  diagonal.  An  SVD 
is  given  by  A  =  U EVr 


To  construct  V ,  we  take 


A<°’  =  A, 

and  iterate 

A('~l)  =  A(,)Q[x\  t  =  0,  1, ... 

with  orthogonal  until  some  matrix  A(l*  has  orthogonal  columns  Q ^  is 
chosen  to  be  a  product  of  n(n  —  1  )/2  plane  rotations 


n(n  —  l)/3 

Q(i)  =  n  «i,)- 

}  =  l 

Every  possible  pair  (r,s),  1  <  r  <  a  <  n,  is  associated  with  one  of  the  rotations 
(the  association  is  independent  of  t)  in  this  way:  the  rotation  Q ^  is  chosen 
to  make  columns  r  and  s  of 

a"'  n  «l" 

At  =*=  1 

orthogonal.  The  process  of  going  from  A ^  to  is  called  a  "sweep”.  Every 

permutation  of  the  set  of  pairs  corresponds  to  a  different  cyclic  Hestenes  method. 

The  correspondence  with  the  Jacobi  method  is  thiE.  The  sequence  A(,) 
converges  to  the  diagonal  matrix  Ea  of  eigenvalues  of  AT A.  Moreover 

where  *>  is  the  product  of  n(n  —  l)/2  of  Jacobi  rotations  that  zero,  in  some 

T* 

cyclic  order,  each  ofT-diagonal  element  of 

The  permutation  chosen  by  Brent  and  Luk  allows  the  rotations  to  be  applied 
in  parallel  in  groups  of  n/2.  Their  permutation  consists  of  n  —  1  groups  of  n/2 
pairs  such  that,  in  each  group,  every  column  occurs  once.  Thus,  the  n/2  rotations 
corresponding  to  a  pair-group  commute.  They  can  be  applied  in  any  order  or,  in 
fact,  in  parallel. 

The  SVD  array  is  shown  in  Figure  1  There  are  n/2  processors.  Each 
processor  holds  two  matrix  columns.  Initially  processor  »  holds  column  2*  —  1  in 
its  “left  memory”  and  column  2»  in  its  “right  memory”. 

In  each  cycle,  each  processor  computes  and  applies  to  its  two  columns  a 
plane  rotation  that  makes  them  orthogonal.  Next,  using  the  connections  shown 
in  Figure  1,  columns  move  to  neighboring  processors.  This  generates  a  new  set 
of  n/2  column-pairs. 

After  n  —  1  cycles,  n(n  —  l)/2  pairs  of  columns  have  been  generated  and 
made  orthogonal.  It  can  be  shown  (by  a  parity  a'gument)  that  no  pair  occurs 
twice  during  this  time.  Thus,  every  pair  it  generated  exactly  once. 


Figure  1  The  Brent-Luk  S\rD  array,  n  —  8 


A  diagram  (given  in  j  2  j  originally)  showing  the  movement  of  columns  through 
the  array,  very  important  in  the  considerations  to  follow,  is  given  in  Figure 
2. 


Figure  2.  Flow  of  data  in  the  S\rD  array,  n  =  8 


s 


» 


I 
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3.  SOLVING  LARGER  PROBLEMS 

We  now  consider  the  problem  of  finding  an  SVD  when  A  has  n  columns,  the  array 
has  p  processors,  and  n  >  2 p. 

The  usual  approach  to  this  problem  is  to  imagine  that  a  “virtual”  array,  large 
enough  to  solve  the  problem  (having  ff]  or  more  processors,)  is  to  be  simulated 
by  the  given,  small,  physical  array.  Moreover,  the  simulation  must  be  efficient. 
The  array  should  not  spend  a  large  amount  of  time  loading  and  unloading  data 

For  some  arrays,  this  simulation  is  trivial.  One  finds  a  subarray  of  the  virtual 
array,  of  the  same  size  as  the  physical  array,  for  which  all  the  input  streams  are 
known  Clearly  the  action  of  such  a  subarray  can  be  carried  out  and  its  outputs 
stored  These  outputs  then  become  the  inputs  to  other  subarrays.  This  process 
continues  until,  subarray  by  subarray,  the  computation  of  the  entire  virtual 
array  has  been  performed.  If  this  technique  is  possible  we  say  that  the  array  is 
“decomposable”.  The  various  matrix  multiply  arrays  [7],  the  Gentleman-Kung 
array  i  5 ],  and  the  Schreiber-Kuekes  backsolve  array  |8]  are  good  examples  of 
decomposable  arrays. 

Some  arrays  are  indecomposable:  the  Kung-Leiserson  band-matrix  LU  fac¬ 
torization  array,  for  example  |7l. 

The  Brent-Luk  arrays  are  indecomposable  Consider  Figure  2.  Suppose 
a  two-processor  array  is  available.  Can  it  simulate  the  four-processor  array?  It 
cannot,  not  efficiently  anyway,  since  there  does  not  exist  a  two-processor  segment 
of  Figure  2  for  which  only  known  data  enters.  If  this  diagram  is  cut  by  a  vertical 
line,  data  flows  across  the  line,  in  both  directions,  every  cycle  The  data  cannot 
be  known  if  only  the  computations  on  one  side  of  the  line  have  been  performed. 

Here,  we  shall  present  a  solution  to  this  problem.  The  idea  is  to  have  a 
given  p-processor  array  simulate  a  pq-processor  “superarray”  which  is  not  of  the 
Brent-Luk  type.  Moreover,  the  superarray  is  decomposable.  In  its  space-time 
dataflow  gTaph  the  processors  occur  in  groups  of  p  For  long  periods,  of  either  p 
or  2p  —  1  cycles,  there  is  no  data  flow  between  groups.  Thus,  the  physical  array 
can  efficiently  carry  out  the  computation  of  the  superarray,  groupbygroup. 

We  give  two  such  superarrays.  The  first  implements  a  Hestenes  method  in 
which  a  “sweep”  corresponds  to  a  permutation  of  a  multiset  of  offdiagonal  pairs. 
There  is  some  redundancy,  some  pairs  are  generated  and  orthogonalized  several 
times.  The  second  implements  a  cyclic  Hestenes  method  with  a  permutation 
different  from  Brent  and  Luk’s.  For  this  method,  a  minor  change  must  be  made 
to  the  array. 

We  have  compared  these  new  sweeps  to  the  Brent-Luk  sweep  These  experi¬ 
ments  indicate  that  the  first  superarray  is  about  20  —  60%  less  efficient  than  the 
Brent-Luk  array,  while  the  second  superarray  is  virtually  equal  to  the  Brent-Luk 
array  in  efficiency. 
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3.1  METHOD  A 


The  method  is  easiest  to  explain  in  terms  of  an  example.  Suppose  we  have  a  4 
processor  array.  Suppose  there  are  16  columns  in  A.  We  proceed  as  follows. 

1.  Load  columns  1-8  and  perform  a  Brent-Luk  sweep  on  them; 

2  Load  columns  9-16  and  perform  a  Brent-Luk  sweep; 

3.  Load  columns  1  -A,  13—16;  perform  a  Brent-Luk  sweep, 

4.  Load  columns  5-6,  9-12;  perform  a  Brent-Luk  sweep; 

5  Load  columns  1—4,  9-12;  perform  a  Brent-Luk  sweep; 

6.  Load  columns  13-16,  5-8;  perform  a  Brent-Luk  sweep. 

Steps  1-6  together  consistute  an  A  —  supersweep  or  AS  sweep.  During  an 
ASsweep,  every  column  pair  is  generated.  Some  are  generated  more  than  once. 

To  describe  the  general  case,  suppose  there  is  a  p  processor  array,  and  n  — 
2pq  (pad  A  with  zero  columns,  if  necessary,  so  that  2 p  divides  n).  Imagine  that 
the  matrix  A  consists  of  q  supercolumns  or  Scolumns:  supercolumn  At  consists 
of  columns 

q{i  —  1)  -h  1,  •  qi. 


Now  consider  a  q-auperprocesBor  or  Sprocessor  virtual  superarray  or  Sarray. 
Each  5processor  holds  two  Scolumns  (one  in  each  of  its  left  and  right  memories) 
In  one  supercycle  (Scycle)  the  Sprocessors  each  perform  a  single  Brent-Luk  sweep 
over  the  2p  columns  in  their  memory. 

(Obviously  we  can  simulate  an  Scycle  of  an  Sprocessor  using  one  p-processor 
Brent-Luk  array  and  2p —  1  cycles  of  time.  Moreover,  we  can  be  loading  the  data 
for  the  next  Scycle  and  unloading  the  data  from  the  preceding  Scycle  at  the  same 
time  as  we  process  the  data  for  the  current  Scycle.) 

Initially,  Scolumns  Ai  and  Aa  are  in  S  processor  1,  Aa  and  A«  in  Sprocessor 
2,  etc. 

Between  Scycle6,  the  Scolumns  move  to  neighboring  Sprocessors.  The 
scheme  for  moving  Scolumns  is  precisely  the  same  as  the  scheme  for  moving 
ordinary  columns  in  a  ^-processor  Brent-Luk  array. 

After  2 q  —  1  Scycles,  we  have  generated  every  pair  of  Scolumns  exactly  once. 
Together  these  2q—  1  S cycles  constitute  an  ASsweep.  During  an  ASsweep,  every 
pair  of  columns  of  A  is  orthogonalized.  If  two  columns  are  in  different  S columns 
then  they  are  orthogonalized  once,  during  the  Scycle  in  which  their  containing 
Scolumns  occupy  the  same  Sprocessor.  If  they  are  in  the  same  Scolumn,  then 
they  are  orthogonalized  2ty  —  1  times. 

In  units  of  cycles,  the  time  for  an  ASsweep,  Tas,  is 


Tas  =  (2 q  —  l)Scyclei  *  (2p  —  1) 
=  (2q  —  l)(2p  —  1) 


cycles 

Scycle 


(Of  course,  the  simulation  by  a  p- processor  array  takes  q  times  this  time.)  The 
time  for  a  Brent -Luk  sweep  over  n  columns,  Tbl,  i* 


Tbl  =  r»  —  1  —  2  pq  —  1. 


Thus,  the  ASsweep  takes  longer;  the  ratio  of  times  satisfies 


Tas 

Tbl 


<  2. 


(The  lower  bound  arises  in  the  simplest  nontrivial  case  p  =  q  =  2.) 

There  is  little  theoretical  basis  for  comparing  the  effectiveness  of  ASsweeps 
and  Brent-Luk  sweeps  in  reducing  the  nonorthogonality  of  the  columns  of  A. 
We  have,  therefore  performed  an  experiment.  A  set  of  square  matrices  A  whose 
elements  were  random  and  uniformly  distributed  in  [  — 1,1]  were  generated.  Both 
ASsweeps  and  Brent-Luk  sweeps  were  used  until  the  6um-of-squares  of  the  off- 
diagonal  elements  of  ATA  was  reduced  to  10“ 12  times  its  initial  value.  We  show 
the  results  in  Table  1.  The  number  of  test  matrices,  the  average  number  of 
sweeps,  the  largest  number  for  any  test  matrix,  and  the  relative  time 

_  Tas  *  average- sweeps  (AS) 

P  ~  Tbl  *  aver  age- sweeps  (BL) 


are  shown. 

Evidently  one  ASsweep  is  more  effective  in  reducing  nonorthogonality  than 
one  Brent-Luk  sweep.  This  is  not  surprising,  since  more  orthogonalisations  are 
performed.  Their  cost-effectiveness,  however,  is  roughly  20  —  60%  less. 


Averages  Maxima 


p 

q 

n 

trials 

AS 

BL 

AS 

BL 

P 

2 

2 

8 

320 

3.98 

4.33 

5 

5 

1.18 

2 

4 

16 

160 

5.10 

5.38 

6 

7 

1.33 

2 

8 

32 

80 

6.18 

6.29 

7 

7 

1.43 

4 

2 

16 

160 

4.80 

5  40 

5 

6 

1.24 

4 

4 

32 

80 

5.99 

6.31 

7 

7 

1.50 

4 

8 

64 

20 

7.05 

7.55 

8 

8 

1.57 

8 

2 

32 

80 

5.25 

6.28 

6 

7 

1.21 

8 

4 

64 

10 

6.60 

7  60 

*7 

1 

8 

1  45 

16 

2 

64 

20 

6.00 

7.30 

6 

8 

1.21 

Table  1.  Comparison  of  Break-Luk  sweeps  and  ASsweeps 
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In  order  to  g&nge  the  reliability  of  the  statistics  generated  by  this  experiment, 
we  also  measured  the  standard  deviations  of  the  sampled  data.  In  all  cases,  the 
standard  deviations  were  less  than  0.5.  For  the  samples  of  size  80  or  more,  the 
standard  errors  of  the  means  are  no  more  than  0.06,  so  these  statistics  are  quite 
reliable  For  the  samples  of  size  20  and  10,  these  data  may  be  in  error  by  as  much 
as  10%. 


3:2  METHOD  B 

Method  A  suffers  some  loss  of  speed  because,  in  an  ASsweep,  some  column-pairs 
are  generated  many  times.  By  making  a  small  modification  to  the  Brent-Luk 
array  and  using  the  new  array  as  our  basic  tool,  we  can  simulate  a  new  supersweep, 
called  an  ABSsweep,  during  which  every  column-pair  is  generated  exactly  once 
Figure  3  shows  the  modified  array.  The  connection  from  processor  1  to 
processor  p  is  new.  Note  that  a  ring  connected  set  of  processors  can  easily  simulate 
this  structure.  This  array  is  still  able  to  perform  Brent-Luk  sweeps  over  sets  of 
2p  columns.  But  it  can  also  perform  a  second  type  of  sweep,  which  we  call  an 
“Afl-sweep”  that  we  now  describe. 


Figure  3 .The  modified  SVD  array;  n=8 

In  an  AB-sweep,  a  pair  (A,  B)  of  Scolumns,  each  consisting  of  p  columns, 
is  loaded  into  the  array.  During  the  sweep,  all  pairs  (a,  6),  a  6  A,  b  6  B  are 
generated  exactly  once.  But  no  pairs  from  A  X  A  or  B  x  B  are  generated. 

To  implement  an  .AB-sweep,  place  the  columns  of  A  in  the  p  left  memories 
and  the  columns  of  B  in  the  p  right  memories  of  the  processors.  (The  set  of  left 
(resp.  right)  processor  memories  is  the  Sprocessor’s  left  (resp.  right)  memory, 
rather  than  the  memories  of  the  leftmost  (resp  rightmost)  p/2  processors) 
Processors  do  precisely  what  they  did  before:  orthogonalize  their  two  columns. 
Between  cycles,  A  remains  stationary,  while  B  rotates  one  position,  using  the 
connections  shown  as  solid  lines  in  in  Figure  3. 

An  ASSsweep  is  this.  Again  we  work  with  2 q  Scolumns  of  p  columns  each 
The  initial  configuration  is  as  for  an  ASsweep  During  the  first  Scycle,  which 
takes  2p  —  1  cycles,  every  Sprocessor  performs  a  Brent-Luk  sweep  on  the  2 p 
columns  in  its  memory.  On  subsequent  Scycles,  all  Sproce6sors  perform  AB- 
sweeps,  where  the  sets  A  and  B  are  the  two  S  columns  in  its  memory.  Between 
Scycles,  Scolumns  move  as  before. 


It  is  easy  to  see  that  in  an  ABSsweep  every  column  pair  is  generated  once. 
Thus  this  scheme  implements  a  true  cyclic  Hestenes  method.  The  permutation 
differs,  nevertheless,  from  the  Brent-Luk  permutation. 

Again,  we  have  compared  the  new  scheme  to  the  Brent-Luk  scheme  by  an 
experiment.  The  experimental  set  up  was  precisely  the  same  as  for  the  previous 
experiment.  The  results  are  shown  in  Table  2. 


Averages  Maxima 


p 

q 

n 

trials 

ABS 

BL 

ABS 

BL 

2 

2 

8 

320 

4.32 

4.33 

5 

5 

2 

4 

16 

160 

5.35 

5.38 

6 

7 

2 

8 

32 

80 

6.36 

6.29 

7 

7 

4 

2 

16 

160 

5.36 

5.40 

6 

6 

4 

4 

32 

80 

6.18 

6.31 

1 

7 

4 

8 

64 

20 

Y.50 

7.55 

8 

8 

8 

2 

32 

80 

6.13 

6.28 

7 

7 

8 

4 

64 

10 

7.10 

7.60 

8 

8 

16 

2 

64 

20 

7.00 

7.30 

7 

8 

Table  2.  Comparison  of  Brent-Luk  sweeps  and  ABSsweeps 

Evidently,  ABSsweeps  are  as  effective  as  Brent-Luk  sweeps.  The  standard  devia¬ 
tions  of  the  number  of  ABSsweeps  needed  were  also  all  less  than  0.5. 


4.  THE  EIGENVALUE  ARRAY 

Decomposition  of  the  eigenvalue  array  presents  the  same  difficulty,  and  is  amenable 
to  the  same  solution,  as  the  SVD  array.  We  need  not  present  the  details  here. 
Note,  however,  that  in  simulating  a  pqXpq  eigenvalue  array,  a  p  X  p  array  must  be 
used  to  simulate  both  diagonal  subarrays  where  its  diagonal  processors  generate 
rotations,  and  off-diagonal  subarrays,  where  all  its  ceils  only  apply  rotations.  The 
array  of  Brent,  Luk,  and  Van  Loan  for  the  SVD  can  also  be  treated  in  this 
way. 
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1.  INTRODUCTION 


In  this  paper  we  present  a  new  systolic  array  for  band  matrix  QR 
factorization.  We  then  point  out  the  relationships  between  this 
array  and  three  other  systolic  arrays:  the  hexagonal ly  connected 
LU  factorization  array  of  Rung  and  Leiserson  [4],  the  Heller-Ipsen 
array  for  QR  factorization  [3],  and  an  LU  variant  of  the  Heller- 
Ipsen  array  that  has  not  actually  appeared  anywhere  in  the  literature. 

Next,  we  show  how  to  compute  the  QR  factorization  of  a  banded 
rectangular  matrix  on  a  systolic  array.  Matrices  of  this  kind  arise 
in  least-squares  collocation  methods  for  elliptic  differential 
equations  and  some  integral  equations. 

The  results  presented  here  are  also  applicable  to  the  solution 
of  ill-posed  problems  by  regularization  methods.  The  details  of 
this  application  shall  appear  in  another  paper  [2].  In  fact,  the 
application  was  investigated  first,  by  Elden. 


1.1  Prel iminary  concepts 

Let  B  be  any  matrix.  We  say  B  is  a  (p,q)  banded  matrix  if 


b  .  . 


0  forall  i>j+p  and  all  j  >  i  +  q . 


We  let  w»1+p+q,  the  number  of  nonzero  diagonals  of  B.  When  necessary 
we  write  p(B)  (q(B),w(B))  to  distinguish  the  matrix  in  question. 


We  are  concerned  with  the  computation  of  factorizations 


(1.1) 


B  =  QR 


where  Q  is  orthogonal  and  R  upper  triangular  and  also  factorizations 
(1.2)  PB  =  LR 


where  P  is  a  permutation  matrix,  L  is  lower  triangular  with  diagonal 
elements  one,  and  R  is  upper  triangular.  The  factors  L  and  R  are  also 
banded,  with 

q(U)  <  min(n-1,  p(B)  +q(B>) 
p(L)  =  p(B) 

We  let  b.  denote  the  iC'"1 


row  of  B ,  1  f  i  <  n . 


r 


We  shall  consider  the  implementation  by  systolic  arrays  of  two 
algorithms  for  the  factorizations  (1.1)  and  (1.2).  The  algorithms 
are  alike  in  that  they  zero  every  element  below  the  main  diagonal 
of  B,  in  predetermined  sequence.  Hie  QK  algorithms  use  plane 
rotations  to  do  so;  the  LR  algorithms  use  elementary  row  operations. 
What  distinguishes  the  two  QR  algorithms  is  the  pair  of  rows  rotated 
in  zeroing  an  element;  say  by.  In  one  of  them,  rows  i~1  and  i  are 
used;  in  the  other  rows  j  and  i  art  used.  For  the  LR  algorithms  the 
distinction  is  che  same:  to  zero  by  one  subtracts  a  multiple  of 
either  row  i-1  or  row  j  from  row  i.  To  be  specific,  we  use  either 

Algorithm  1  (Central  strategy): 
for  j  :  =  1  to  n- 1 

for  i  j  +  1  to  rain(n,j+p) 

process  (b^  ,b j  ,  j) 
or 

Algorithm  2  (Neighbor  strategy) 
for  j  :  =  1  to  n-1 

for  i  :=min(n,j+p)  t£  j  +  1  step  -1 
process  (byb^.j) 

In  the  case  of  a  QR  factorization,  process  (x,£,j)  zeros  the  jth 
element  of  x  by  applying  a  plane  rotation  to  x  and  v.  In  the  case  of 
an  LR  factorization,  process  (x,^,j)  zeros  the  jC^  element  of  x  by 
subtracting  a  multiple  of  %  from  x,  after  first  exchanging  x  and  v, 
if  necessary,  to  keep  the  absolute  value  of  the  multiplier  less  than 
or  equal  to  1 . 

For  some  matrices,  the  LR  factorization  exists  and  can  be  accurately 
computed  by  the  central  strategy  without  row  interchanges.  This  is 
the  algorithm  implemented  bv  Rung  and  Leiserson  [il. 

In  discussing  parallel  processor  arrays  we  use  terms  and  conventions 
that  have  become  standard;  one  can  consult  the  review  payer  of  Brent, 
Kune,  and  Luk  [l]  for  background.  In  particular,  we  use  the  term 

"V-  to  denote  the1  fraction  of  processor  cycles  that  a  tvpical 
processor  is  actively  employed  in  an  array. 


2.  A  NEW  ARRAY  FOR  BAND  MATRIX  QR  FACTORIZATION 


We  present  here  an  array  Implementing  Algorithm  1  for  QR  factorization 
The  array,  and  the  manner  in  which  imput  data  enters,  is  shown  in 
Figure  1;  the  format  of  the  output  in  Figure  2;  the  definitions  of  the 
cells  in  Figure  3. 

The  efficiency  of  this  array  is  1/3. 

Three  identically  structured  matrices  could  be  interleaved  and 
factored  simultaneously.  In  that  use  the  efficiency  would  be  1. 

The  array  uses  (p+1 ) (q  + p/2  +  1 )  ceils  of  which  q+1  are  simply 
delay  cells  (diamonds  in  Figure  1),  p  generate  rotations  (circles 
in  Figure  1),  and  the  rest  apply  rotations  (squares  in  Figure  1). 

If  bn  enters  the  array  at  time  1,  then  b  enters  at  time  1  +  3(n-1) 

nn 

the  last  output  element,  r  ,  leaves  the  array  at  time  3n  +  2p  -  1 . 

In  its  interconnection  structure  this  array  is  the  same  as  the 
Kung-Leiserson  array  [4],  but  it  has  more  cells,  of  different  types. 

By  making  obvious  changes  to  the  circle  and  square  cells,  we  can 
create  an  LR  factorization  array.  Both  these  arrays  use  the  central 
strategy  of  Algorithm  1 . 

An  array  for  QR  factorization  using  the  neighbor  strategy  of 
Algorithm  2  was  given  by  Heller  and  Ipsen  [3].  With  a  similar  change 
to  the  cell  definitions  their  array  becomes  an  LR  factorization 
ar^iy,  too.  The  Heller-Ipsen  array  contains  pw  active  cells  and  is 
1 /2-ef f ic ient .  It  has  no  delay  cells.  The  last  output  element  emerges 
at  time  2n  + 2p  -  1 ,  so  it  is  faster  than  the  present  array. 

The  principle  advantage  of  the  present  array  is  its  applicability 
to  generalizations  of  the  problem  for  which  it  was  devised.  We 
illustrate  one  such  application  in  the  next  sections,  and  another 
in  the  paper  mentioned  earlier  [2]. 


carries  plane  rotations; 


carry  matrix  elements. 


Figure  1.  The  array  for  QR  and  LK  faciorizat ion  by  the  central  strateev 


3.  RELATED  ARRAYS 


In  this  section  we  present  two  arrays  similar  to  the  band  QR  array. 

The  first  computes  QR  factorizations  of  lower  triangular,  banded, 
rectangular  matrices.  The  second,  which  has  a  rather  unusual  structure 
in  which  two  triangular  arrays  are  interleaved,  computes  QR  factor¬ 
izations  of  block  2  *  1  matrices 


in  which  both  A  and  B  are  lower  triangular,  banded,  rectangular 
matrices.  In  the  next  section,  this  array  is  used  to  generate  QR 
fac tor izat ionsof  banded  rectangular  matrices  of  a  kind  arising  from 
integral  and  differential  equations.  With  the  obvious  changes  to  the 
cell  definitions,  all  the  results  apply  to  LR  factorization,  too 


n 

i — 1 


D 


carries  plane  rotations 


tarrv  r..ts'Lx  e.emi-nts 


Figure  4.  QR  factorization  of  a  BLTE  (1,2)  matrix  with  5  columns 


3.1  Factorization  of  lower  triangular,  banded  elongated  matrices 


Definition.  A  matrix  L  is  banded  (p),  lower  triangular,  elongate  (s) 
(BLTE  ( p , s) )  if 

1 .  Lis  n+s  x  n  , 


By  2,  for  any 

row  i  >  n  +  p , 

every  el 

that  0  <  s  <  p . 

For  example. 

if  L  is 

X 

X 

0  ' 

L 

= 

X 

X 

^  X 

.0 

X 

x'  X 

emen  t  ? . . 

ij 

BLTF.  (3,2) 


=  0;  thus 
then 


we  may  assume 


In  Figure  4,  we  illustrate  a  systolic  array  for  QR  factorization  of  a 
BLTE  (p,s)  matrix  where  p  =  3  and  s=2.  In  general  the  array  is  ( p 1 ) *(p+ 1 ) 
and  triangular  with  horizontal,  vertical,  and  diagonal  connections. 

In  fact,  it  is  the  array  of  Section  2,  specialized  to  the  case  q  =  0. 

Note  that  the  input  format  is  the  same  as  for  a  square  banded 
matrix,  although  its  extent  in  time-space  is  different.  Efficiency  is 
1/3.  The  output,  an  n*n  square,  upper  triangular  (o,p)  banded  matrix, 
emerges  from  the  top  cells  in  the  format  shown  in  Figure  2. 


3.2  Factorization  of  block  2*1  matrices  with  BLTE  blocks 
Here  we  consider  factorization  of  matrices 

A 
B 

where  A  is  BLTE  (P^>SA)>  B  is  BLTE  (pB,s^),  and 
PB-PA  or  pA-l 

Before  presenting  the  array,  we  sketch  the  algorithm.  Let  us  take  a 
specif ic  case,  =  Pg  =  2  ,  sa  =  SB  =  1  ’  ^  or  iHustrati°n"  We  must  factor 
a  matrix  whose  form  is 


1 

J 


The  strategy  is  this.  We  eliminate  elements  of  A  m  the  lower  triangle 
by  a  central  strategy.  There  are  two  such  elements  in  a  typical  column. 
Let  them  be  eliminated  at  times  2  and  4.  There  are  three  elements  to  be 
eliminated  in  the  corresponding  column,  say  column  j,  of  B.  Let  these 
be  eliminated,  by  rotations  involving  row  j  of  A,  at  times  1,  3,  and  5. 

This  strategy  causes  no  unnecessary  nonzero  elements  to  be  created 
in  either  A  or  B.  When  eliminating  b.  .  (0  < r < sD)  row  j+r  of  B  has 

nonzeros  in  columns  j,  j  +  1,  ...  ,j+r.  At  the  same  time,  row  j  of  A  has 

precisely  the  same  structure.  So  there  is  no  "fill-in",  no  new  nonzero 
is  created.  When  a.  .  (1  <  r <  s.)  goes,  row  j  gets  a  nonzero  element 
in  column  j+r.  These  are  the  only  fill-ins. 

The  array  is  shown  in  Figure  5.  It  consists  of  two  interleaved 
triangular  arrays.  One  of  these,  the  A  array,  is  just  the  array  of 
Section  3.1.  It  includes  those  cells  in  odd-numbered  columns  of  the 

array.  They  operate  only  on  elements  of  A,  reducing  it  to  an  upper 

triangular  (0,  pA>  banded  matrix.  In  the  other  array,  which  occupies 
the  even-numbered  columns,  plane  rotations  are  applied  to  pairs  of 
rows,  a  row  from  A  and  a  row  from  B,  to  zero  ail  the  elements  of  B. 


In  Figure  5  elements  of  A  are  shown  by  their  indices,  elements  of  B 


are  shown  as  b...  S..  (R.  .)  represents  the  rotation  that  has  zeroed 

ij  ij  ij  K 


b..  (a..).  The  case  p .  *  p  =  3  is  shown .  If  p„  were  2 ,  we  would  el imina  te 
ij  ij  rA  rB 


the  top  row  of  the  B-array.  If  p_  <  p.  -  1  (or  p_  >  p.)  we  must  include 

a  A  d  A 


zero  diagonals  in  the  band  of  B  (or  A)  to  bring  pD  up  to  p,  -  1  (or 

D  A 


PA  up  to  Pg) ,  otherwise  the  array  will  not  work  correctly. 


The  efficiency  is  1/4.  Cells  active  at  the  time  shown  are  emphasized. 


The  circled  indices  in  Figure  5  represent  elements  of  the  upper  triangle 
of  A.  When  leaving  the  array  at  the  top,  they  are  the  elements  of  R. 


-10- 


4.  FACTORIZATION  OF  BANDED  RECTANGULAR  MATRICES 

In  least  squares  collocation  method  for  ordinary  and  elliptic  boundary 
value  problems  or  certain  integral  equations,  for  example 
1 

J  k(x,y)u(y)dy  =  g(x) 

0 

where  k(x,y)  =  0  if  x-y  <  d  <  1  ,  we  may  encounter  rectangular  banded 
matrices.  Thus  we  consider  QR  factorization  of  m*n  matrices  A  where, 
for  some  integer  p  >  1  , 

(4.1)  m  =  1  +  e(n- 1 ) . 


To  generalize  the  notion  of  a  diagonal,  we  define,  for  1  <  j  <n 


r(j)  *  1  +  P ( j  —  I ) 


and  consider  the  elements  a  ...  .to  be  the  main  diagonal.  Then  we  let 
c(i)  s  r-1(i)  -  1  +  [ ( i-1 ) /p] . 


We  say  that  A  is  (p,q)  banded  if 

(4.2)  a ^  =  0  unless  c(i)-p<j<c(i)+q. 

For  exampl  e,  when  p  =  3 ,  p=q  =  1,  and  n  =  4 ,  A  has  the  form 


The  first  step  in  the  systolic  factorization  reorders  the  rows  of  A, 
so  that  after  reording, 


-1 1- 


L 


R  ,  is  the  desired  factor  of  A:  the  rotations  that  constitute  Q  are 
o-l 

the  rotations  that  consitute  Q.,  1  <  i<j.  (We  shall  denote  elements  of 

R.  bv  ,<») 
i  1  k  i 


To  compute  the  factorizations  (4.3)  and  (4.4)  we  use  the  array  of 
Section  3.2.  It  is  first  necessary  to  put  the  matrices  into  the  form 
required:  block  2X 1  in  which  both  blocks  are  BLTE.  To  do  so,  we  add 
zero  rows.  For  (4.3),  we  add 
q  zero  rows  to  Ao  and 
q  zero  rows  to  Ai,  so  that 
Ac  becomes  BLTE  (p+q,q)  and 

Ai  becomes  BLTE  (p+q-1 ,q-1)  .  Thus  tin  t ac tor iza t ion  can  be  done  by 
a  (p+q+1)  *  (p+q+1)  triangular  A  array  with  an  embedded  B  array  ot 
size  (p+q)  x  (p+q) . 
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For  (4. A),  we  note  that  Rj_j  is  (0,  p+q)  banded.  So  we  include 
p+q  zero  rows  on  top  to  make  it  BLTE  (p+q, p+q).  is  treated  like 
Aj.  The  same  array  can  handle  the  work. 

Because  elements  of  the  matrices  enter  every  4  clocks,  it  takes 

4n  + 3(p+q)  +0(1)  clocks  to  complete  any  of  the  factorizations  (4.3) 

and  (4.4);  3(p+q)  is  the  length  of  the  longest  path  through  the  array. 

The  overall  factorization  need  not  take  p-1  times  as  long,  since  the 

factorizations  (4.4)  can  be  performed  in  a  pipelined  manner.  In  fact, 

if  we  say  that  a  factorization  (4.4)  begins  when  r^  * ^  enters  the 

array  (it  is  the  first  element  of  R.  .  to  do  so)  then  we  can  begin  the 

l”1  (2) 

next  such  factorization  1  +p  +  q  clocks  later,  as  this  is  when  r^ 
leaves.  Thus,  if  enough  hardware  is  available,  the  entire  factorization 
can  be  obtained  in  4n  ♦  (p+2) (p+q)  +0(1)  clocks. 

To  achieve  this  throughout,  we  need  hardware  to  work  on  roughly 
min(p,  4n/ (p+q) ) 

factorizations  at  a  time,  the  second  term  being  the  ratio  of  the  time 
for  a  factorization  to  the  interval  at  which  factorizations  can  begin. 

A  single  array  can  work  on  4  (interleaved)  factorizations  at  once. 

Thus,  min(p/4,  n/(p+q))  arrays  of  (p+q)‘  cells  can  be  used.  These 
designs  are  fully  efficient  when  4  or  more  factorizations  are  simul- 
tanously  performed. 
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Abstract 

Systolic  architectures  due  to  Brent,  Luk,  and  Van  Loan  are  today  the  most  promising  idea 
for  computing  the  symmetric  eigenvalue  and  singular  value  decompositions  in  real  time. 

These  systolic  arrays,  however,  are  only  able  to  solve  problems  of  a  given,  fixed  size. 

Here  we  present  two  modified  algorithms  and  a  modified  array  that  do  not  have  this  disadvan¬ 
tage  .  The  results  of  a  numerical  experiment  show  that  a  combination  of  one  of  the  new 
algorithms  and  the  new  array  is  just  as  efficient  as  the  Brent- Luk-Van  Loan  method. 

Introduction 

Systolic  arrays  are  of  significant  and  growing  importance  in  numerical  computing,  espe¬ 
cially  in  matrix  computation  and  its  applications  to  digital  signal  processing  1 >*  There  is 
now  considerable  interest  in  systolic  computation  of  the  singular  value  decomposition* * " »** * 
and  the  symmetric  eigenvalue  problem.*'?  These  decompositions  are  needed  to  implement  some 
recently  advocated  methods  in  signal  processing.*'19 

To  date,  the  most  powerful  systolic  array  for  the  eigenvalues  of  a  symmetric  n  x  n 
matrix  is  a  square  n/2  x  n/2  array  due  to  Brent  and  Luk.  This  array  implements  a  certain 
cyclic  Jacobi  method.  It  take  0(n)  time  to  perform  a  sweep  of  the  method  and  O(log  n) 
sweeps  for  the  method  to  converge.*  A  very  similar  array  is  used  by  Brent,  Luk,  and  Van 
Loan  to  compute  the  singular  value  decomposition  (SVD)  of  an  m  x  n  matrix  in  time  m  + 

0(n  log  n).11  Brent  and  Luk  had  previously  described  an  n/2-procesaor  linear  array  that 
requires  0 (mn  log  n)  time  for  the  SVD.1 

This  paper  deals  with  the  important  practical  issue  of  how,  with  an  array  of  a  given 
fixed  size,  we  can  solve  problems  of  arbitrary  size. 

The  Jacobi  and  Hestenes  methods  and  the  Brent-Luk  arrays 


We  shall  concentrate  on  Hestenes1  method  for  the  SVD.  Let  A  be  a  given  m  x  n  matrix. 

An  SVD  of  A  is  a  factorization  A  -  urvt  where  V  is  orthogonal,  E  is  nonnegative  and  diag¬ 
onal,  and  U  has  orthonormal  columns.  The  method  starts  with  the  given  matrix  A  and  builds 
an  orthogonal  matrix  V  such  that  AV  has  orthogonal  columns.  An  SVD  is  obtained  by  normal¬ 
izing  the  columns  of  AV. 

The  orthogonal  transformation  V  consists  of  a  sequence  of  plane  rotations.  Each  rotation 
orthogonalizes  a  pair  of  columns  of  A.  Column  pairs  are  orthogonalized  one  at  a  time  until 
all  pairs  have  been  orthogonalized.  This  constitutes  one  sweep  of  the  method.  The  order 
in  which  pairs  are  orthogonalized  is  fixed.  The  method  is  equivalent,  in  real  arithmetic, 
to  a  cyclic  Jacobi  method  for  the  eigenvalues  of  A^A. 

The  order  chosen  by  Brent  and  Luk  allows  the  rotations  to  be  applied,  in  parallel,  in 
groups  of  n/2.  The  linear  array  that  does  this  is  shown  in  Figure  1,  There  are  n/2  pro¬ 
cessors.  Each  processor  holds  two  matrix  columns.  Initially  processor  i  holds  column  2i-l 
in  its  "left"  memory  and  column  2i  in  its  "right”  memory.  In  every  cycle,  every  processor 
applies  a  plane  rotation  to  the  columns  in  its  two  memories.  The  processors  choose  these 
plane  rotations  to  make  the  resulting  columns  orthgonal.  Then,  columns  move  to  adjacent 
processors  using  the  connections  shown  in  Figure  1.  This  generates  a  new  set  of  n/2  pairs 
of  coluams. 

After  n-1  cycles,  n(n-l)/2  pairs  of  columns  have  been  generated  and  made  orthogonal,  it 
can  be  shown  that  no  pair  occurs  twice.  Thus  every  pair  is  generated  exactly  once. 


Figure  1.  The  Brent-Luk  SVD  array;  n-8 


Figure  2.  Flow  of  data  in  tha  SVD  array;  n  «  8 
Solving  larger  problaaa 


we  now  conaidar  tha  problem  of  finding  tha  SVD  of  A  whan  A  has  n  columns,  tha  array  has 
p  processors,  and  n>  2p.  Tha  usual  approach  to  this  problam  is  to  imagine  that  a  "virtual" 
array  large  enough  to  solve  tha  problam  (which  means ,  in  this  ease,  that  tha  array  has  at 
least  n/2  procassors)  is  to  be  simulated  by  the  smaller  array  which  is  actually  available. 
Moreover,  tha  simulation  must  ba  efficient.  Tha  array  should  not  spend  a  lot  of  time  load¬ 
ing  and  unloading  data. 


For  soma  arrays  this  simulation  is  trivial.  One  finds  a  subarray  of  tha  virtual  array 
that  is  tha  same  size  as  tha  physical  array  and  for  which  all  tha  input  streams  are  known. 
Tha  computation  of  such  a  subarray  can  ba  carried  out  and  its  outputs  stored.  These  out¬ 
puts  than  become  tha  inputs  to  other  subarrays.  This  process  continues  until,  subarray 
by  subarray,  tha  computation  of  tha  entire  virtual  array  has  bean  performed.  If  this  tech¬ 
nique  is  possible,  we  say  that  the  array  is  "decomposable* .  The  various  matrix  multiplica¬ 
tion  arrays,11  the  Gentleman-Rung  array,11  and  the  Schraiber-Kuekes  backsolve  array1"  are 
good  examples  of  decomposable  arrays.  Sosm  arrays  are  indecomposable s  the  Kung-Leiserson 
band  matrix  LO  factorization  array,11  for  example. 


The  Brent-Luk  arrays  are  indecomposable.  Consider  Figure  2.  Suppose  a  two-processor 
array  is  available.  Can  it  simulate  a  four-processor  array?  It  cannot,  not  efficiently, 
anyway,  since  there  does  not  exist  a  two-processor  segment  of  Figure  2  for  which  only  known 
data  enters.  If  this  diagram  is  cut  by  a  vertical  line,  data  flows  across  the  line,  in 
both  directions,  every  cycle.  The  entering  data  cannot  be  supplied  if  only  the  computation 
on  one  side  of  the  line  is  performed. 


We  shall  present  a  solution  to  this  problem.  The  idea  is  to  have  the  p-processor  array 
simulate  a  pq-processor  "superarray"  which  is  not  of  the  Brent- Luk  type.  The  superarray  is 
decomposable  into  p-processor  arrays.  If  one  were  to  make  a  diagram  like  that  of  Figure  2, 
showing  the  flow  of  data  in  space  and  time  in  this  superarray,  the  processors  would  occur 
in  groups  of  p,  and  for  rather  long  periods  there  would  be  no  data  flowing  between  the 
groups.  Thus  the  physical  array  can  efficiently  carray  out  the  computation  of  the  super¬ 
array,  group  by  group. 


We  give  two  such  superarrays.  The  first  implements  a  method  in  which,  during  a  sweep, 
some  pairs  of  columns  are  orthogonalized  more  than  once.  The  second  implements  a  cyclic 
Hestenes  method,  but  the  order  in  which  pairs  are  orthogonalized  is  different  from  the 


order  used  by  Brent  and  Luk.  In  order  to  implement  this  method,  a  slight  modification  of 
the  Brent- Luk  array  is  required. 

we  have  compared  these  new  sweep  orders  to  the  one  used  by  Brent  and  Luk.  These  exper¬ 
iments  indicate  that  the  first  superarray  is  less  efficient,  by  between  20%  and  60%,  than 
the  Brent-Luk  array,  while  the  second  is  as  efficient  as  the  Brent-Luk  array. 

Method  A 

The  method  is  easiest  to  explain  by  giving  an  example.  Suppose  we  have  a  4-processor 
array.  Suppose  there  are  16  columns  in  A.  We  proceed  as  follows: 

1.  Load  columns  1-8  and  perform  a  Brent  Luk  sweep; 

2.  Load  columns  9-16  and  perform  a  Brent-Luk  sweep; 

3.  Load  columns  1-4,  13-16;  perform  a  Brent-Luk  sweep; 

4.  Load  columns  5-8,  9-12;  perform  a  Brent-Luk  sweep; 

5.  Load  columns  1-4,  9-12;  perform  a  Brent-Luk  sweep; 

6.  Load  columns  13-16,  5-8;  perform  a  Brent-Luk  sweep. 


Steps  1-6  together  constitute  an  A-supersweep  or  ASsweep.  During  an  ASsweep  every 
column  pair  is  generated.  Some  are  generated  more  than  once. 

To  describe  the  general  case,  suppose  there  is  a  p  processor  array  and  that  n  ■  2pq. 

(Pad  A  with  zero  columns,  if  necessary,  so  that  2p  divides  n.)  Imagine  that  the  matrix  A 
consists  of  q  supercolumns  or  Scolumns.  Scolumn  A^  consists  of  columns  q(i-l)+l,  .  .  . ,  qi 

of  A.  Now  consider  a  q-superprocessor  (or  Sprocessor)  virtual  superarray  (or  Sarray) . 

Each  Sprocessor  holds  two  Scolumns,  one  in  its  left  and  one  in  its  right  memory.  In  one 
supercycle  (or  Scycle)  each  Sprocessor  performs  a  single  Brent-Luk  sweep  over  the  2p 
columns  in  its  memories.  Obviously  a  p-processor  Brent-Luk  array  can  implement  an  Scycle 
of  an  Sprocessor  during  2p-l  of  its  own  cycles.  Moreover,  it  can  load  data  for  the  next 
Scycle  and  unload  data  from  the  preceding  Scycle  at  the  same  time. 

Initially  Scolumns  At  and  A*  are  in  Sprocessor  1,  Aj  and  Ak  in  Sprocessor  2,  etc.  Be¬ 
tween  S cycles,  the  Scolumns  move  to  neighboring  Sprocessors.  The  scheme  for  moving  Scol¬ 
umns  is  precisely  the  same  as  the  scheme  for  moving  ordinary  columns  in  a  q-processor 
Brent-Luk  array. 

After  2q-l  Scydes,  every  pair  of  Scolumns  has  been  generated  exactly  once.  Together 
these  2q-l  Scycles  constitute  an  ASsweep.  During  an  ASsweep,  every  pair  of  columns  of  A  is 
orthogonalized.  If  two  columns  are  in  different  Scolumns,  then  they  are  orthogonalized 
once,  during  the  Scycle  in  which  their  containing  Scolumns  occupy  the  same  Sprocessor.  If 
they  are  in  the  same  Scolumn,  then  they  are  orthogonalized  2q-l  times. 

In  units  of  ordinary  processor  cycles,  the  time  for  an  ASsweep,  T^,  is 


TAS  *  (2p-l)  cycles. 

The  time  for  a  Brent-Luk  sweep  over  n  columns,  T^,  is 
TBL  "  n*1  "  2^~1- 

Thus,  the  ASsweep  takes  longer;  the  ration  of  times  satisfies 


2. 


(The  lower  bound  arises  in  the  simplest  nontrivial  ease,  p  »  q  »  2.) 

There  is  little  theoretical  basis  for  comparing  the  effectiveness  of  ASsweeps  and  Brent- 
Luk  sweeps  in  reducing  the  nonorthogonality  of  columns  of  A.  We  have,  therefore,  performed 
an  experiment.  A  set  of  square  matrices  A  whose  elements  were  random  and  uniformly  distri¬ 
buted  in  (-1.11  were  generated.  Both  ASsweeps  and  Brent-Luk  sweeps  were  used  until  the  sum 
of  the  squares  of  the  off-diagonal  elements  of  A*A  was  reduced  to  10“ 12  times  its  original 
value.  We  show  the  results  in  Table  1.  The  number  of  test  matrices,  the  average  number  of 
sweeps,  the  largest  number  for  any  test  matrix,  and  the  relative  time 


*  average-sweeps  (AS) 

*  average-sweeps  (BL) 


are  shown. 
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Table  1. 


J2_ 

g 

n 

trials 

Averages 

AS  BL 

Maxima 

AS  BL 

0 

8 

320  " 

3.98 

4.33 

5 

1.19 

2 

4 

16 

160 

5.10 

5.38 

6 

7 

1.33 

2 

8 

32 

30 

6.18 

6.29 

7 

7 

1.43 

4 

2 

16 

160 

4.80 

5.40 

5 

6 

1.24 

4 

4 

32 

30 

5.99 

6.31 

7 

7 

1.50 

4 

8 

64 

20 

7.05 

7.55 

3 

8 

1.57 

a 

2 

32 

30 

5.25 

6.28 

6 

7 

1.21 

8 

4 

64 

10 

6.60 

7.60 

7 

8 

1.45 

16 

2 

64 

20 

6.00 

7.30 

6 

8 

1.21 

Evidently  on*  ASsweep  is  more  effective  in  reducing  nonorthogonality  than  on*  Brent-Luk 
sweep.  This  is  not  surprising,  since  more  orthogonal! zations  are  performed.  They  are, 
however,  roughly  20>€0%  less  cost-effective. 


In  order  to  gauge  the  reliablility  of  the  statistics  generated  by  this  experiment,  the 
standard  deviations  of  the  sampled  data  were  measured.  Zn  all  cases,  the  standard  devia¬ 
tion  was  less  than  0.5.  For  the  samples  of  size  30  or  more,  the  standard  error  of  the  mean 
is  therefor*  no  more  than  0.06,  so  these  statistics  are  quite  reliable.  For  the  samples  of 
size  20  and  10,  the  data  are  unlikely  to  be  in  error  by  more  than  10% 


Method  B 


Method  A  suffers  some  loss  of  speed  because,  in  an  ASswaep,  some  column  pairs  are  gene¬ 
rated  more  than  once.  By  making  a  small  modification  to  the* Brent-Z.uk  array,  we  can  simu¬ 
late  a  new  supersweep,  called  an  ABSsweep,  during  which  every  column  pair  is  generated 
exactly  once . 


Figure  3  shows  the  modified  array.  The  connection  from  processor  1  to  processor  p  is 
new.  a  ring-connected  set  of  processors  can  easily  simulate  this  structure.  The  array  is 
still  able  to  perform  a  Brent-Luk  sweep  over  a  sat  of  2p  columns.  But  it  can  also  perform 
a  second  type  of  sweep,  called  an  ABsweep,  that  w*  now  describe. 


Figure  3.  The  modified  SVD  array;  n  *  8 

Zn  an  ABsweep,  a  pair  (A,B)  of  Scolumns  (each  having  p  columns)  is  loaded  into  the  array 
During  the  sweep,  all  pairs  (a,b) ,  a  e  A  and  b  e  B  are  generated  exactly  once.  But  no 
pairs  from  A*A  or  B»B  are  generated. 

To  implement  an  ABsweep,  place  the  columns  of  A  in  the  p  left  memories  and  the  columns 
of  B  in  the  p  right  memories  of  the  individual  processors.  (The  set  of  left  (resp.  right) 
processor  memories  is  the  Sprocessor's  left  (resp.  right)  memory.)  Processors  do  just  what 
they  did  before:  orthogonaliz*  their  two  columns.  Between  cycles,  A  remains  stationary , 
while  B  rotates  on*  position  using  the  connections  shown  as  solid  lines  in  Figure  3. 

An  ABSsweep  is  this.  There  are  2q  Scolumns  of  p  columns  each.  The  initial  configura¬ 
tion  is  as  for  an  ASswaep.  During  the  first  Scycle,  every  Sprocessor  performs  a  Brent-Luk 
sweep  on  the  2p  columns  in  its  memory.  On  subsequent  Scycles,  All  Sprocessors  perform 
ABsweep*.  Between  Scycles,  Scolumns  move  as  before. 

Zt  is  easy  to  see  that  in  an  ABSsweep  every  column  pair  is  generated  exactly  once.  Thus 
this  scheme  implements  a  true  cyclic  Hestenes  method.  The  permutation  differs,  neverthe¬ 
less  from  that  used  by  a  Brent-Luk  array. 

Again  w*  have  compared  the  new  scheme  to  the  Brent-Luk  scheme  by  an  experiment.  The 
experimental  set  up  was  the  same  as  for  the  previous  experiment.  The  results  are  shown  in 
Table  2.  Evidently  ABSsweeps  are  as  effective  as  Brent-Luk  sweeps.  The  standard  deviation 
of  the  number  of  ABSsweeps  needed  was,  in  all  cases,  less  than  0,5. 
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Tab  la  2.  Comparison  of  Brant-Luk  wwpt  and  ABSsweeps 


q 

n 

trials 

Averages 

ABS  BL 

Maxima 

ABS  BL 

2^ 

3 

320 

4.32 

4.33 

5 

2 

4 

16 

160 

S .  35 

5.38 

6 

7 

2 

8 

32 

80 

6.36 

6.29 

7 

7 

4 

2 

16 

160 

5.36 

5.40 

6 

6 

4 

4 

32 

80 

6.18 

6.31 

7 

7 

4 

8 

64 

20 

7.50 

7.55 

8 

9 

8 

2 

32 

80 

6.13 

6.23 

7 

7 

8 

4 

64 

10 

7.10 

7.60 

a 

3 

16 

2 

64 

20 

7.00 

7.30 

7 

8 

Tha  eigenvalue  array 

Decomposition  of  the  Brent- Luk  eigenvalue  array  presents  the  same  difficulty,  and  is 
amenable  to  the  same  solution,  as  the  SVD  array.  Tha  array  of  Brent,  Luk,  and  Van  Loan 
for  the  SVD  can  also  be  treated  in  this  way,  but  there  is  as  yet  no  experimental  evidence 
to  suggest  how  effective  this  technique  would  be. 
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